SEMI-ANNUaL 

TECHNICAL  RSD  STATUS  REPORT 
OCTOBER  1963  --  APRIL  j9£i 


R.W.  BRCJJESSF.N 


A.  DE SPAIN 
D.A.  KODCES 


MESSERSCHMI 
!.  MULLER 
NEUREUTHEF. 

NEWTON 
I.  OLDHAM 
OUGTERHOUT 
..  PATTERSON 
SANGICVAKND 
SEOUIN 


SPONSORED  BY 


ADVANCED  RESEARCH  PRO, 
ARP A  ORDER  NO.  40.U 


MONITORED  BY' 
UNDER 


;aVAL  electronic  systems  command 
ONTRACT  NO.  M0G03StC~0107 


ari.h  ?roj?cts  Agency  or  _ 
*f»,ln  a(vnirL.aot  bo<KJ  crPR,ov' 
kv  p.U.uC  rpl'“as<  «cd  *i» 

'dtatetbudou  U  uwktaltwi _ ; 


ELECTRONICS  RESEARCH  LABORATORY 

College  of  Engineering 

University  of  California,  Berkeley,  CA  94720 


security  classification  of  this  page  !»*«•>  d*** 

f  REPORT  DOCUMENTATION  PAGE 


I.  REPORT  NUMBER 


«.  TITLE  (mid  Submit) 

VLSI  Research 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


J.  RECIPIENT'S  CAT  ALOG  NUMBER 


S.  TYPE  OP  REPORT  ft  PERIOD  COVERED 

Technical  RS.D  Status  Report 
October  1983  -  April  1984 


<■  performing  org.  report  number 


1.  AUTHOR!  A) 


S.  CONTRACT  OR  GRANT  NUMBER!*) 


R.W.  Brodersen,  et  al 


N00039-C-0107 


».  PERFORMING  ORGANIZATION  NAME  ANO  AOORESS 

University  of  California,  Berkeley 
404  Cory  Hall 
Berkeley,  CA  94720 


II.  CONTROLLING  OFFICE  NAME  ANO  AOORESS 


10.  PROGRAM  ELEMENT,  PROJECT.  TASK 
AREA  ft  WORK  UNIT  NUMBERS 


II.  REPORT  OATE 


IS.  NUMBER  OF  PAGES 

1.500  approximate! 


MONITORING  AGENCY  NAME  ft  AOORESS!!!  dlUtrtnl  (rod i  Controlling  Ottlet)  IS.  SECURITY  CLASS,  (o t  (Ala  report) 

Department  of  Navy 

Naval  Electronic  Systems  Command 

Washington,  D.C.  20363  is*,  declassification/ downgrading 

°  SCHEDULE 


IS.  DISTRIBUTION  STATEMENT  !o!  ttilt  Report) 

Approved  For  Public  Release  Distribution  Unlimited 


17.  DISTRIBUTION  STATEMENT  (ol  l ho  tbtlrtcl  entered  In  Btock  20,  It  dlUtrtnl  from  Report) 


IS.  KEY  WOROS  (Continue  on  nvirtt  tldt  It  ntctettry  tnd  Identity  by  block  number) 


DTIC 

ELECTE 

JUN  1  2  1984 


20.  ABSTRACT  (Con  firm#  on  rovoroo  «frfo  If  noeoooorr  « nrf  Idfitltr  by  bfocb  ntwbor) 


I  J  AN  7S  1473  EOITION  OF  I  NOV  SB  IS  OBSOLETE 


SECURITY  CLASSIFICATION  OF  TN IS  PAGE  '*htn  Pitt  Knitted) 

,N  .V  \  .S  V.'.'.V.V  .V. ' 

'  .  .  i.  •  _  .  .  *“  N  _  ft  _  *  _■  _  «T_  '  *  •  .  ft  „  1  «  *  -  ft  .  *-«.«.*.ft,S 


VLSI  RESEARCH 


SEMI-ANNUAL 

TECHNICAL  R&D  STATUS  REPORT 
OCTOBER  1983  -  APRIL  1984 


PRINCIPAL  INVESTIGATOR 

R.V.  BRODERSEN 
(415-542-1779) 


FACULTY  RESEARCHERS 


R.V.  BRODERSEN 
N.  CHEUNG 
A.  DESPAIN 
D.A.  HODGES 

C.  HU 
R.  KATZ 
P.  KO 

D.  MESSERSCHMITT 
R.S.  MULLER 

A.  NEUREUTHER 

A.R.  NEWTON 

V.G.  OLDHAM 

J.  OUSTERHOUT 

D.A.  PATTERSON 

A.  SANG I OVANN I -V IN CENTELL I 

C.  SEQUIN 


Accession  For 

NTIS  GRAM 
DTIC  TAB 


* 


Unannounced  □ 

Justification _ 


By- 


Distribution/ 
Availability  Codes 
Avail  and/or 
Special 


Dist 


e 


TABLE  OF  CONTENTS 

I.  Executive  Overview 

II,  Summary  of  Research 

III,  Publications 

a.  Architecture 

b.  Computer  Aids  for  Design  and  Layout 

c.  Circuit  &  System  Design 

d.  Technology 


n 


? 


Executive  Overview 

This  report  covers  the  period  from  October  1983  to  April  1984  on  contract 
No.  N00039-C-0107.  A  few  of  the  highlights  of  this  report  follow. 

A  scaled  version  of  the  RISC  II  chip  has  been  fabricated  and  tested  and 
these  new  chips  have  a  cycle  time  that  would  outperform  a  VAX  11/780  by  about 
a  factor  of  two  on  compiled  integer  C  programs.  The  architectural  work  on  a 
RISC  chip  designed  for  a  Smalltalk  implementation  has  been  completed.  This 
chip,  called  SOAR  (Smalltalk  on  a  RISC),  should  run  programs  4-15  times  faster 
than  the  Xerox  1100  (Dolphin),  a  TTL  minicomputer,  and  about  as  fast  as  the 
Xerox  1132  (Dorado),  a  $100,000  ECL  minicomputer. 

The  1983  VLSI  tools  tape  has  been  converted  for  use  under  the  latest  UNIX 
release  (4.2).  The  Magic  (formerly  called  Caddy)  layout  system  will  be  a  unified 
set  of  highly  automated  tools  that  cover  all  aspects  of  the  layout  process,  includ¬ 
ing  stretching,  compaction,  tiling  and  routing.  A  multiple  window  package  and 
design  rule  checker  for  this  system  have  just  been  completed  and  compaction 
and  stretching  are  partially  implemented.  New  slope-based  timing  models  for 
the  Crystal  timing  analyzer  are  now  fully  implemented  and  in  regular  use.  In  an 
accuracy  test  using  a  dozen  critical  paths  from  the  RISC  II  processor  and  cache 
chips  it  was  found  that  Crystal's  estimates  were  within  5-10%  of  SPICE's  esti¬ 
mates,  while  being  a  factor  of  10,000  times  faster.  - " - 

A  new  approach  to  the  state  assignment  problem  which  allows  minimization 
of  Finite  State  Machines  has  been  developed.  This  work  coupled  with  some  new 
advances  in  two  level  logic  minimization  (ESPRESSO  II),  will  provide  the  basis  for 
a  highly  effective  tool  for  FSM  synthesis.  The  SPLICE  1.7  program  which  has  been 
shown  to  have  more  than  2  orders  of  magnitude  speed-up  over  SPICE  while  giv¬ 
ing  answers  that  are  as  accurate  is  now  at  over  a  100  sites.  The  implementation 
of  this  approach  onto  a  multiprocessor  machine  has  been  investigated  (the  BBN 
Butterfly  machine)  and  a  70%  efficiency  for  a  10  processor  machine  was  found 
realizable  (ie.  70%  of  the  possible  10  times  improvement  over  a  uniprocessor  was 
achieved). 

A  design  study  of  VLSI  communications  has  been  completed  which  discusses 
fault  tolerance,  distribution  of  ports,  routing  and  buffering  schemes  and  self 
testing.  Also  novel  mappings  of  digital  filter  architectures  onto  multiprocessor 
systems  has  been  discovered  which  allow  arbitrarily  high  sampling  rates  with  a 
fixed  speed  technology. 

A  design  frame  for  the  Multibus  system  bus  has  been  developed  and  a  frame 
chip  and  printed  circuit  board  have  been  integrated.  This  should  allow  ”  run 
time  support  "  and  "  system  level  services  "  for  a  chip  designer  in  testing  and 
using  his  chip.  The  design  frame  chip  and  custom  designed  printed  circuit 
board  have  been  operating  in  a  Sun  workstation  under  the  UNIX  operating  sys¬ 
tem. 

A  1000  word  speech  recognition  board,  which  uses  two  special  purpose 
chips,  has  been  successfully  operated  inside  a  SUN  workstation.  The  UNIX 
drivers  which  allow  direct  interaction  between  the  SUN  cpu  and  the  cpu  on  the 
recognition  board  have  been  written.  The  design  of  the  board  has  been 
transferred  to  SRI,  where  one  duplicate  board  has  been  made  and  about  10  more 
are  planned  in  the  near  future. 

The  software  system  which  performs  the  complete  silicon  compilation  of 
digital  filter  banks  from  high  level  filter  descriptions  is  now  being  transferred 
into  industry.  The  complete  generation  of  a  20,000  transistor  circuit,  including 
real  time  testing  of  the  algorithm,  can  be  performed  within  one  day.  A  number 
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of  these  circuits  have  been  fabricated  and  tested  and  they  have  all  been  found 
to  meet  specifications. 

A  single-chip  full  duplex  linear  predictive  vocoder  (LPC)  circuit  has  been 
designed  and  tested.  Some  of  the  chips  have  been  delivered  to  a  company  (GE) 
to  perform  critical  evaluations  of  the  speech  quality  when  configured  into  an 
LPC- 10  system. 

A  quantitative  model  for  CMOS  latch-up  has  been  developed  which  has  led  to 
the  development  of  a  new  technique  for  suppressing  it.  The  SIMPL-1  program  has 
been  completed  which  uses  a  file  of  process  steps  and  the  CIF  layout  information 
to  generate  a  "  scanning  electron  microscope  ”  like  view  of  the  device  topogra¬ 
phy.  A  series  of  models  have  been  developed  aimed  at  the  IC  designer  which 
explain  the  effects  of  hot  electrons  in  a  scaled  technology. 
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1.  ARCHITECTURE 


1.1.  Rounding  up  the  RISC  Project  (D.  Patterson.  C.  SAquin) 

A  shrunk  version  of  the  RISC  II  chip,  implemented  by  straight-forward  scal¬ 
ing  down  of  the  mask  geometry  to  a  Lambda  of  1.5  microns,  has  been  tested  and 
evaluated.  As  expected,  it  was  functionally  correct  since  it  came  from  the  same 
CIF  file;  but  the  pleasant  surprise  was  that  it  ran  50/S  faster  than  the  previous 
version  with  Lambda  equal  to  2  microns.  No  simulation  had  been  done  for  the 
new  device  geometry,  and  the  fact  that  it  performs  so  well  is  a  tribute  to  the 
ruggedness  of  the  RISC  circuit  design.  It  also  proves,  that  within  limits  the 
straight-forward  Mead-Conway  scaling  is  indeed  practical.  The  new  chips,  run¬ 
ning  at  a  330ns  cycle  time  (VDD=5V,  VBB=VSS=0V,  room  temperature)  would 
outperform  a  VAX  11/780  by  about  a  factor  of  two  on  compiled  integer  C  pro¬ 
grams.  These  results  have  been  presented  at  ISSCC  last  month  [l]. 

This  brings  the  RISC  project  to  a  close.  Manolis  Katevenis  finished  his  PhD 
in  November  1984,  and  Robert  Sherburne  will  finish  his  PhD  within  a  month.  The 
two  theses  document  the  experiences  gained  in  the  RISC  project.  Katevenis’ 
thesis  starts  from  an  analysis  of  inner  loops  of  typical  programs  to  determine 
the  most  frequently  used  operations  and  derives  the  implications  for  the  selec¬ 
tion  of  an  instruction  set  and  for  the  choice  of  features  to  be  supported  in 
hardware  on  a  single-chip  RISC.  From  this  starting  point  it  derives  the  RISC 
microarchitecture  and  discusses  the  associated  trade-offs. 

Sherburne's  thesis  focuses  on  the  circuit  design  in  RISC  II.  It  emphasizes 
the  fact  that  the  circuit  design  has  to  be  seen  in  context  of  the  microarchitec- 
tural  tradeoffs  and  can  even  influence  the  choice  of  the  instruction  set.  In  par¬ 
ticular.  it  discusses  the  optimum  size  of  the  register  file.  Beyond  a  certain 
point,  more  registers  are  not  necessarily  better,  since  the  longer  busses  will 
slow  down  the  overall  machine  cycle.  A  marginal  increase  in  the  hit  ratio  in 
accessing  scalar  operands  is  not  worth  a  slow-down  of  all  instructions. 

The  optimal  replacement  strategy  for  swapping  windows  in  the  RISC  register 
file  has  been  analyzed  in  a  paper  published  in  IEEE  Transactions  on  Computers 
[5], 


1.2.  Architecture  for  Software  Prototyping  (D.  Patterson,  D.  Hodges) 

We  have  completed  the  initial  versions  of  the  architecture  simulator  and 
Smalltalk-80  compiler  for  SOAR  (Smalltalk  On  A  RISC),  our  software  prototyping 
architecture.  We  have  recently  run  8  small  Smalltalk-80  benchmarks  on  the 
simulator  and  found  promising  results.  A  800  ns  cycle  SOAR  will  run  these  small 
programs  4  to  15  times  faster  than  the  Xerox  1100  (Dolphin),  a  TTL  minicom¬ 
puter,  and  run  between  .3  to  1.5  times  the  speed  of  the  Xerox  1132  (Dorado),  an 
ECL  minicomputer  costing  over  $100,000.  Our  next  step  in  performance  analysis 
will  be  to  complete  the  other  40  "micro"  benchmarks,  and  then  run  the  dozen 
large  Smalltalk  benchmarks.  These  Urge  benchmarks  will  give  accurate  predic¬ 
tions  of  the  performance  of  SOAR,  but  we  are  very  encouraged  by  these  early 
estimates. 

The  next  step  will  be  to  complete  the  layout,  simulation,  and  timing 
verification  to  estimate  accurately  the  SOAR  cycle  time. 
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1.3.  Novel.  High-Performance  Architectures  (S.  Baden.  A  Despam) 

We  are  studying  the  problems  of  multiprocessor  system  design.  In  particu¬ 
lar,  the  partitioning  and  synchronization  problems  that  arise  when  many  proces¬ 
sors  cooperate  on  dynamically  changing  computational  structures.  Such  struc¬ 
tures  occur  in  very  difficult  calculations  that  today  are  only  attempted  on  the 
largest  machines. 

We  have  begun  evaluating  benchmark  programs  that  will  be  used  to  gen¬ 
erate  trace  tapes  on  a  Cray-1.  We  will  use  these  tapes  to  measure  the  dynamic 
resource  demands  of  the  benchmarks  (such  as  how  many  processes  could  be 
executed  in  parallel  if  there  were  sufficient  processing  elements  available?)  in 
order  to  refine  the  design  of  our  multiprocessor  architecture.  Later,  the  tapes 
will  also  be  used  in  trace-driven  simulations  of  our  machine  design. 

At  the  moment  we  are  looking  at  a  method  for  solving  non-linear  Partial 
Differential  Equations  (PDEs),  in  the  presence  of  strong  shocks  (and  other  local 
irregularities),  called  Adaptive  Mesh  Refinement  (AMR).  AMR  is  attractive 
because  it  can  be  used  to  achieve  more  accurate  results  while  incurring  minimal 
space  and  time  penalties.  As  it  is  representative  of  a  much  larger  class  of  prob¬ 
lems  that  cannot  be  statically  partitioned,  it  is  of  particular  interest  to  us. 

Basically,  AMR  works  by  refining  the  solution  grid  wherever  the  solution  is 
changing  'too'  rapidly  (as  indicated  by  a  Richardsonian  error  estimate)  and  by 
operating  on  a  reduced  time  scale  in  those  regions.  Thus,  the  solver  is  able  to 
distribute  the  computer’s  resources  (memory  words  and  processor  cycles) 
where  they  are  needed  the  most,  rather  than  distributing  them  uniformly  over 
the  entire  solution  grid.  Owing  to  the  the  dynamic  nature  of  this  algorithm,  i.e. 
refined  grids  can  shrink,  grow,  and  disappear  unpredictably,  it  is  not  possible  to 
partition  an  AMR  solver  at  compile  time.  Thus,  to  avoid  excessive  serial  execu¬ 
tion  bottlenecks  and  communications  delays,  a  scientific  multiprocessor  must 
provide  run  time  support  for  partitioning  and  load  balancing  activities. 

Recently  published  results  support  our  hypothesis:  in  their  investigation  of 
an  adaptive,  parallel,  finite  element  system  (FEARS),  simulated  on  two  different 
multiprocessor  architectures  (cm*  and  ZMOB,  256  Z-80’s  connected  on  a  ring), 
Zave  and  Cole  noted  that  serial  bottlenecks  accounted  for  80%  of  the  total  exe¬ 
cution  time,  and  communication  only  1%.  Maximum  speedups  of  only  3-5  were 


measured.  These  results  aren’t  surprising:  in  FEARS  no  attempt  is  made  to  par¬ 
tition  newly  generated  subgrids  (as  in  AMR,  multi-level  grid  structures  are  used 
for  improved  accuracy),  so  if  some  were  much  larger  than  the  others,  then  a 
small  number  of  processors  would  be  doing  most  of  the  work,  and  not  communi¬ 
cating  very  often.  The  results  of  the  FEARS  study  were  not  conclusive  and  we 
will  test  our  hypothesis  out  by  extracting  the  distribution  of  subgrid  sizes  from 
the  Cray  trace  tapes. 

Adaptive  partitioning  has  not  received  much  attention  except  in  classic 
dataflow.  True,  partitioning  is  trivial  in  this  instance  and  this  has  been  cited  as  a 
major  reason  for  adopting  a  dataflow-style  architecture.  But  owing  to  the  great 
cost  of  synchronizing  low  level  scalar  operations  (among  other  difficulties)  clas¬ 
sic  dataflow  is  unsuitable  for  applications  like  AMR;  instead,  high  level,  macro 
operations  have  to  be  used  (i.e.,  vector  add).  It  is  for  this  reason  that  we  believe 
that  the  problem  of  higher-level  adaptive  partitioning  cannot  be  avoided,  even  if 
a  non-traditional  architecture  were  to  be  used. 

We  envision  using  a  data-driven  model  of  execution  for  our  machine  design: 
the  nodes  perform  functions  at  a  much  higher  level  than  classic  dataflow  (indeed 
they  will  be  traditional,  Cray-l-like  arithmetic  and  logical  functional  units),  arcs 
are  bidirectional  and  have  storage,  and  more  complex  firing  rules  are  used  to 
admit  the  processing  of  data-streams.  In  such  a  system  it  would  be  possible  to 
exploit  both  the  advantages  of  dataflow  (dynamic  detection  of  concurrency)  as 
well  as  the  advantage  of  control  flow  (efficiency  of  low-level  operations  such  as 
vector  arithmetic). 

We  believe  that  adaptive  numerical  techniques  will  become  more  prevalent 
in  the  future,  owing  to  the  emergence  of  commercial  multiprocessors  (i.e.  the 
BBN  Butterfly  Machine,  the  Denelcor  HEP-1,  and  the  Cray-XMP).  Unless  the 
problem  of  adaptive  partitioning  is  well  understood,  multiprocessors,  whether 
they  be  of  a  traditional  design  or  not,  will  not  be  cost-effective  for  the  novel 
numerical  methods. 


1.4.  Multiprocessor  Circuit  Simulation  (D.G.  Messerschmitt) 

The  project  in  which  scheduling  algorithms  were  developed  for  concurrent 
execution  of  the  Doolittle  LU  decomposition  algorithm  has  been  completed. 
These  scheduling  algorithms  have  specific  application  to  LU  decomposition  in 
circuit  simulation,  and  more  generally  to  the  concurrent  execution  of  a  program 
on  multiple  processors  where  the  communication  delay  between  processors  is 
significant  in  comparison  to  the  computation  time.  Two  types  of  scheduling 
heuristics  were  developed  and  compared:  local  algorithms  and  global  algo¬ 
rithms.  In  the  local  algorithms,  Hu’s  level  scheduling  algorithm  was  modified  to 
attempt  to  find  the  best  mapping  between  ready  tasks  and  available  processors 
to  minimize  communication  delay.  In  the  global  algorithms,  the  remaining  long¬ 
est  path  was  mapped  onto  a  single  processor  in  order  to  minimize  the  effect  of 
communication  delay  on  this  critical  path.  Significant  speedup  in  using  these 
algorithms,  often  greater  than  50%,  was  obtained  on  the  LU  decomposition  of 
sparse  matrices  extracted  from  SPICE  simulations  of  actual  circuits.  This  work 
has  been  reported  in  a  thesis  [l]  and  papers  are  in  preparation. 

As  reported  six  months  ago,  a  method  of  achieving  arbitrarily  high  sampling 
rates  in  IIR  digital  filtering  with  a  fixed  speed  of  hardware  has  been  discovered. 
This  technique  is  reported  in  a  recent  thesis  by  Lu  [2],  and  will  be  reported  in  a 


paper  in  the  near  future.  This  technique  has  been  extended  to  an  arbitrary 
filter  (previously  only  simple  poles  were  allowed).  In  addition,  an  approach  pro¬ 
posed  by  Rao  and  Kailath  at  Stanford  in  which  good  numerical  properties  are 
maintained  by  using  an  orthogonal  state  matrix  has  been  extended  to  allow  high 
sampling  rates.  Some  preliminary  work  has  been  done  on  the  partitioning  of  this 
implementation  into  individual  chips. 

The  project  in  which  automatic  generation  of  topology  for  interconnection 
of  multiple  processors  for  a  specific  applications  domain  is  nearing  completion. 
Methods  of  measuring  the  performance  of  a  given  interconnection  which  are 
hopefully  simple  yet  related  to  program  speedup  are  being  examined.  A  thesis 
in  this  area  should  be  completed  within  six  months. 

Finally,  work  has  started  in  the  design  of  algorithms  and  multiprocessor 
architectures  for  implementation  of  the  finite  element  method  widely  used  in 
seismic  and  structural  modelling. 
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1.5.  VLSI  Multicomputers  (C.  Sequin) 

The  design  study  of  a  VLSI  communications  component  has  come  to  a  con¬ 
clusion  with  the  completion  of  the  PhD  thesis  by  Richard  Fujimoto  [l].  This 
thesis  discusses  the  design  of  VLSI  components  with  about  100,000  devices  that 
would  permit  the  construction  of  VLSI  multicomputer  networks  which  communi¬ 
cate  through  dedicated  links  between  nearest  neighbors. 

It  is  shown  that,  when  considering  a  fixed  maximum  output  bandwidth  from 
a  single-chip  component,  it  is  preferable  to  have  only  a  few  ports  with  as  much 
bandwidth  each  as  possible.  The  higher  bandwidth  per  port  almost  always  more 
than  compensates  for  the  larger  number  of  node-to-node  hops  in  a  network  with 
correspondingly  lower  branching.  In  addition,  the  tradeoffs  among  different 
routing  and  buffering  schemes  are  also  analyzed.  Many  crucial  results  were 
obtained  with  the  help  of  SIMON  [2],  a  simulator  for  such  closely  coupled  multi¬ 
computer  systems  running  on  a  VAX  under  UNIX.  Fujimoto,  now  a  professor  at 
Utah,  will  continue  to  work  in  that  area. 

In  addition  to  high  performance,  high  reliability  is  another  strong  reason  to 
consider  multicomputer  systems.  When  executing  large  computation  tasks, 
such  as  those  involved  in  weather  forecasting  or  wind-tunnel  simulation,  the  sys¬ 
tem  may  have  to  operate  cr  -.inuously  for  many  hours.  Due  to  various  degen¬ 
erative  processes  and  due  to  our  inability  to  completely  test  VLSI  chips,  there  is 
a  high  probability  that  a  component  of  the  computing  system  will  generate 
incorrect  results  during  the  execution  of  large  tasks.  We  are  studying  ways  to 
prevent  such  component  failure  from  leading  to  system  failure  through  the  use 
of  fault  •tolerance  techniques.  In  particular,  we  are  exploring  the  use  of  tech¬ 
niques  that  involve  a  small  performance  penalty  and  do  not  require  significant 
increases  in  the  complexity  of  the  system. 
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A  multicomputer  is  especially  well  suited  for  fault-tolerance  techniques 
since  it  is  partitioned  into  independent  and  “intelligent”  components  (the 
nodes).  Fault-free  components  can  adapt  to  changes  in  faulty  components  and 
continue  their  operation  in  a  way  that  leads  to  correct  system  output  despite 
the  fault.  Detecting  errors  immediately  after  they  occur,  greatly  simplifies  the 
error  recovery  procedures  that  must  be  invoked  in  order  to  restore  the  system 
to  a  valid  state.  This  is  accomplished  through  the  use  of  self-checking  nodes 
that  signal  an  error  to  their  neighbors  when  they  produce  incorrect  results. 
With  VLSI,  an  effective  way  to  implement  a  self-checking  node  is  by  using  dupli¬ 
cate  functional  modules  whose  outputs  are  continuously  compared.  Such  a 
duplication  and  matching  scheme  at  the  processor  level  [3]  also  has  the  advan¬ 
tage  of  conceptual  simplicity. 

The  critical  circuit  in  this  scheme  is  a  comparator  which  must  not  be  sus¬ 
ceptible  to  faults  that  can  remain  undetected  and  later  mask  the  failure  of  the 
functional  modules.  Since  physical  defects  can  affect  the  comparator,  it  must 
be  self-testing  so  that  it  produces  an  error  indication  when  it  incurs  such  a 
defect.  Based  on  a  new  fault  model  for  PLA's  we  have  shown  that  with  both  NMOS 
and  CMOS  technologies  a  PLA  can  be  used  to  implement  such  a  comparator.  The 
design  of  such  a  comparator  and  its  behavior  under  most  conceivable  faults  that 
are  likely  to  occur  in  a  VLSI  system  are  analyzed  in  [4].  The  analysis  even  for 
this  simple  and  regular  component  is  rather  difficult  and  tedious;  because  of 
that,  the  duplication  and  matching  scheme  looks  particularly  attractive  since  it 
requires  such  a  detailed  analysis  for  only  one  component:  the  self-checking  com¬ 
parator. 
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2.  COMPUTER  AIDS  FOR  DESIGN  AND  LAYOUT 


2. 1.  1983  VLSI  Tools  Distribution,  4.2  version  (J.  Ousterhout) 

The  1983  tools  tape,  which  we  have  been  distributing  since  April  1983,  has 
been  upgraded  to  run  under  the  4.2  version  of  Berkeley  Unix  (the  initial  version 
of  the  tape  runs  only  under  version  4.1).  Except  for  the  switchover  to  4.2,  there 
are  no  major  changes  to  the  programs.  The  new  tape  has  been  available  since 
early  February  1984. 


2.2.  The  Magic  Layout  System  (J.  Ousterhout) 

Magic  is  a  new  VLSI  layout  system  that  has  been  under  development  for 
about  a  year  (until  recently,  it  was  called  ‘‘Caddy").  In  the  last  six  months  we 
have  completed  the  implementation  of  the  multiple-window  package  and  the 
design-rule  checker.  The  window  package  allows  a  number  of  overlapping  win¬ 
dows  of  different  types  to  coexist  on  the  color  display.  Different  windows  may 
contain  different  views  on  the  same  circuit  or  views  on  different  circuits.  Infor¬ 
mation  can  be  copied  from  one  window  to  another.  Windows  can  also  be  used  for 
other  functions  such  as  menus,  a  glyph  editor,  or  a  color  map  editor. 

Magic’s  design-rule  checker  is  an  incremental  one  that  runs  in  background 
to  update  erro*-  information  as  soon  as  possible  after  the  circuit  has  changed. 
Magic  records  areas  that  have  been  modified  and  remembers  this  information 
until  the  areas  have  been  re-checked,  even  if  this  doesn’t  happen  until  a  later 
editing  session.  For  small  changes,  the  re-check  occurs  instantaneously.  For 
large  changes,  such  as  moving  a  large  cell  so  that  it  overlaps  another  large  cell, 
more  time  may  be  required.  Early  measurements  indicate  that  the  checker  can 
process  about  800  tiles/second  when  working  entirely  within  one  cell,  or  about 
200  tiles/second  when  registering  information  from  overlapping  subcells. 

Compaction  and  stretching  are  provided  with  a  new  operation  called  “plow¬ 
ing”,  which  allows  portions  of  the  circuit  to  be  re-arranged  while  maintaining  the 
design  rules  and  connectivity.  The  plowing  design  was  completed  late  in  1983.  A 
straw-man  implementation  was  developed  in  the  summer  and  early  fall  of  1983 
to  test  out  the  basic  ideas.  The  straw-man  used  simplified  design-rules  and 
operated  in  only  a  single  direction.  In  late  1983  we  began  implementation  of  a 
new  version  to  handle  real  design  rules  and  hierarchical  designs.  That  imple¬ 
mentation  has  just  recently  become  operational,  but  it  still  works  in  only  a  sin¬ 
gle  direction. 


2.3.  More  Accurate  Timing  Models  for  Crystal  (J.  Ousterhout) 

The  new  slope-based  timing  models  for  the  Crystal  timing  analyzer  are  now 
fully  implemented  and  in  regular  use  by  designers.  Their  accuracy  was  ben¬ 
chmarked  against  the  SPICE  circuit  simulator,  using  a  dozen  critical  paths  from 


the  RISCII  processor  and  cache  chips.  On  average,  Crystal’s  delay  estimates 
were  within  5-10%  of  SPICE’s  estimates,  even  though  Crystal  s  simplified  delay 
calculation  algorithm  is  approximately  10,000  times  faster  than  SPICE’s. 
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2.4.  Unite-State  Machine  Synthesis  (R.  Newton  and  A.  Sangiovanni-Vincentelli) 
Our  research  in  this  period  has  been  concentrated  on  four  major  issues: 
finite-state  machine  synthesis,  relaxation-based  circuit  simulation,  special- 
purpose  architectures  for  the  solution  of  large  scale  systems,  and  simulated 
annealing  techniques  for  placement  and  routing  of  standard  cells,  gate-arrays 
and  macro-cells. 

Sequential  circuits  play  a  major  role  in  the  control  part  of  digital  systems. 
We  addressed  the  automated  synthesis  of  sequential  logic  functions  in  a  struc¬ 
tured  VLSI  design  methodology.  We  considered  sequential  logic  functions  imple¬ 
mented  by  synchronous  deterministic  Finite  State  Machines  (FSM)  consisting  of 
two  distinct  components:  a  combinational  circuit  implemented  in  two  levels  of 
logic,  using  a  Programmable  Logic  Array  (PLA)  or  constrained  gate-matrix  style, 
and  a  memory  implemented  by  Delay-type  registers,  such  as  a  set  of  C^MOS 
latchs. 

In  particular  we  considered  the  problem  of  assigning  binary  codes  to  the 
internal  states  of  a  Finite  State  Machine.  In  our  past  research,  we  have  pro¬ 
posed  the  generation  of  adjacency  rules  for  the  encoding  of  states  based  on 
heuristic  techniques  developed  by  extracting  a  few  operations  used  by  logic 
minimizers  to  reduce  the  number  of  product  terms  of  a  PLA  realization  of 
FSMs[l].  We  also  proposed  new  algorithms  for  the  solution  of  the  graph  embed¬ 
ding  problem  arising  from  the  rules  determined  by  the  heuristic  rules.  Unfor¬ 
tunately,  most  logic  minimizers  that  have  been  developed  recently  are  very 
complex  programs  based  on  the  use  of  sophisticated  techniques.  Hence,  the 
heuristic  rules  are  only  able  to  capture  a  small  part  of  the  operations  of  logic 
minimizers  such  as  MINI,  PRESTO,  or  the  logic  minimizer  we  developed  some 
time  ago,  POP. 

Recently,  in  collaboration  with  Dr.  Brayton  of  the  IBM  T.J.  Watson  Research 
Center,  we  have  been  able  to  obtain  a  completely  new  approach  to  the  state 
encoding  problem.  The  new  approach  is  based  on  the  observation  that  the  state 
table  representation  of  a  FSM  can  be  thought  of  as  a  personality  matrix  of  a  PLA 
where  the  present  states  and  the  next  states  can  be  considered  as  symbolic  or 
multiple-valued  variables.  Then,  by  using  a  symbolic  or  multiple-valued  logic 
minimizer  we  can  minimize  the  number  of  "  product  terms  ”  (rows)  in  the  state 
table  representation  of  the  FSM.  This  minimization  provides  a  set  of  guidelines 
on  how  to  encode  states  such  that  the  minimization  obtained  by  the  symbolic 
minimizer  can  be  achieved  after  the  states  have  been  encoded.  In  some  sense, 
we  have  turned  the  problem  on  its  side:  minimization  and  state  encoding  are  no 
longer  two  separate  steps. 

The  "  rules  "  obtained  by  applying  the  multiple-value  logic  minimizer  tell 
us  that  a  set  of  states  should  be  assigned  adjacent  codes,  i.e.  they  should  be 
placed  on  the  same  hyper  face  of  the  Boolean  hypercube  representing  the 
number  of  bits  used  in  the  encoding.  Note  that  this  approach  defines  a  new 
combinatorial  optimization  problem  which  we  conjecture  to  be  NP-complete.  We 
devised  a  heuristic  algorithm  with  some  guaranteed  properties  which  assigns 
codes  to  states  so  that  all  the  rules  are  satisfied.  The  dimension  of  the  Boolean 
space  may  be  larger  than  the  minimum  number  of  bits  needed  to  encode  all  the 
states  of  the  FSM,  i.e.,  if  n  is  the  number  of  states,  the  ceiling  of  logzn.  An 
interesting  open  problem  which  we  are  investigating  is  the  trade-off  between  the 
number  of  bits  used  in  the  encoding  and  the  number  of  product  terms  after  the 
encoding  has  been  done.  Note  that  to  satisfy  all  the  adjacency  rules  determined 
by  the  symbolic  minimization  we  may  need  to  use  a  large  number  of  bits.  How¬ 
ever,  such  a  large  number  of  bits  may  be  needed  because  of  one  or  two  rules.  If 
we  decide  not  to  satisfy  these  rules,  the  number  of  product  terms  may  be  two  or 
three  larger  but  the  number  of  bits  used  may  be  smaller,  thus  reducing  the 


overall  area  of  the  implemented  PLA.  The  results  of  this  research  will  be 
reported  in  a  paper  to  be  submitted  to  the  ICCAD  1984. 

In  connection  with  this  research,  we  have  developed  a  new  set  of  algorithms 
for  logic  minimization.  These  algorithms  devised  in  collaboration  with  Dr.  Bray- 
ton  of  IBM,  Prof.  Hachtel  of  the  University  of  Colorado,  Boulder  and  C.  McMullen 
of  Harvard  University,  represent  a  significant  improvement  over  other  available 
techniques.  The  resulting  program  developed  under  IBM  sponsorship  has 
resulted  in  an  APL  program  and  in  a  C  program,  ESPRESSO  II,  which  is  consider¬ 
ably  faster  them  MINI  but  always  generating  better  or  equal  results  than  MINI, 
PRESTO,  or  POP  in  the  final  implementation  of  the  logic.  In  particular,  we  have 
been  able  to  obtain  a  result  that  shows  how  to  use  a  two-level  minimizer  to 
minimize  multiple-value  logic,  thus  making  ESPRESSO  II,  born  as  a  two  level 
logic  minimizer,  an  effective  tool  for  FSM  synthesis.  We  are  developing  a  special¬ 
ized  set  of  algorithms  for  the  multiple  value  logic  minimization  problem  to  speed 
up  even  further  the  symbolic  minimization  step  attached  to  the  FSM  synthesis 
procedure.  We  are  presently  writing  a  monograph  on  the  algorithms  for 
ESPRESSO  II,  which  should  appear  at  the  end  of  the  summer. 


2.5.  Relaxation-based  Circuit  Simulation  (A.  Sangiovanni-Vmcentelli,  R.  New¬ 
ton) 

Over  the  past  six  years,  a  new  class  of  algorithms,  called  Relaxation-Based 
Methods,  has  been  applied  to  the  electrical  IC  simulation  problem.  We  have 
developed  a  number  of  simulators  (RELAX  and  RELAX2,  SPLICE1.6  and  SPLICE2)  that 
use  different  forms  of  these  methods  to  provide  as  accurate,  or  more  accurate, 
waveforms  than  standard  circuit  simulators  such  as  SPICE2  or  ASTAP  with  up  to  two 
orders  of  magnitude  speed  improvement  for  large  circuits.  These  simulators 
have  been  used  for  the  analysis  of  both  digital  and  analog  HOS  ICs,  and  more 
recently  for  the  analysis  of  Bipolar  circuits.  They  use  relaxation  methods  for 
the  solution  of  the  set  of  ordinary  differential  equations,  (ODEs)  which  describe 
the  circuit  under  analysis,  rather  than  the  direct,  sparse-matrix  methods  on 
which  standard  circuit  simulators  are  based. 

During  this  period,  we  studied  the  numerical  properties  of  the  various 
methods  for  the  analysis  of  HOS  circuits  and  we  presented  them  in  a  rigorous  and 
unified  framework  in  [2]  and  we  improved  our  relaxation  algorithms  and  their 
implementation  in  [3-5]. 

Recent  results  with  the  ITA  algorithm  are  presented  in  [3].  In  particular,  we 
have  used  to  program  to  analyze  a  number  of  large,  industrial  circuits  with 
results  as  accurate  as  SPICE.  The  SPUCEl.7  program  is  now  at  approximately 
100  sites  and  we  are  working  actively  with  five  of  those  sites. 

In  [4],  we  describe  RELAX2.1,  a  new  improved  version  of  RELAX,  a 
Waveform-Relaxation  based  simulator  and  the  new  algorithms  implemented  in 
the  program.  In  particular,  we  were  able  to  characterize  the  convergence 
behavior  of  the  Waveform  Relaxation  Method  on  a  class  of  circuits  which 
required  a  large  number  of  iterations  to  converge.  The  study  of  the  convergence 
behavior  has  led  into  the  concept  of  "  windowing  "  ,  i.e.  of  breaking  up  the  time 
interval  over  which  analysis  has  to  be  performed,  in  sub-intervals  so  that  the 
algorithm  applied  in  these  sub-intervals  exhibits  fast  convergence.  In  addition, 
techniques  for  the  automatic  partitioning  of  the  circuit  into  subcircuit  have 
been  included  in  RELAX2,  developed  with  the  sponsorship  of  a  grant  from  MICRO, 


avoiding  the  tedious  operation  of  manually  entering  the  decomposition  of  the 
circuit  into  subcircuits.  The  decomposition  into  subcircuits  can  speed  up  any 
relaxation  algorithm  and  we  are  planning  to  extend  this  technique  to  the 
SPLICE2  program. 

SPLICE2  has  been  developed  as  an  experimental  "  framework  ”  for  explor¬ 
ing  both  relaxation-based  electrical  simulation  as  well  as  mixed-level  simulation. 
The  program  performs  ITA-based  relaxation  electrical  simulation,  as  well  as 
iterated,  relaxation-based  switch  an  logic  simulation.  New  results  in  the  area  of 
convergence  criteria,  including  "  waveform  convergence  "  ,  have  been 
developed,  as  well  as  new  cache-based  event  scheduling  algorithms  that  are 
well-suited  to  the  wide  circuit  time-constants  present  in  a  mixed-level  environ- 
ment[5].  Present  work  involves  extensions  to  Register  Transfer  Level,  a  new 
form  of  analysis  called  ELogic  (Electrical-Logic),  and  the  inclusion  of  path 
analysis  code  for  tagging  critical  circuit  paths  to  be  simulated  in  detail. 


2.6.  Special  Purpose  Architectures  for  the  Solution  of  Large  Scale  Systems  (A. 
Sangiovanni-Vincentelli ,  R  Newton) 

The  solution  of  Large-scale  Systems  of  both  algebraic  and  differential 
Equations(LSE),  is  needed  in  the  analysis  and  simulation  of  many  engineering 
systems. 

New  architectures,  in  particular  vector  computers  such  as  the  CRAY  1,  have 
inspired  the  design  of  new  algorithms  to  exploit  parallelism  in  the  solution  pro¬ 
cess.  An  important  example  is  the  program  CLASSIE  for  the  simulation  of  elec¬ 
tronic  circuits.  Along  these  lines,  peripheral  array  processors,  such  as  the 
FPS164,  can  also  be  used  in  conjunction  with  hosts  such  as  the  VAX1 1/780  to 
speed  up  the  solution  process.  However,  this  speedup  is  not  enough  to  cope  with 
the  problems  to  be  solved  in  the  VLSI  era. 

The  advent  of  VLSI  technology  has  made  the  cost-effective  design  of  special 
purpose  machines  possible.  Examples  of  these  machines  are  the  Yorktown  Simu¬ 
lation  Engine(YSE)  for  logic  simulation  and  the  use  of  and  Systolic  Arrays. 
Special-purpose  machines  have  also  been  proposed  for  the  solution  of  linear, 
algebraic  LSEs.  Most  of  these  machines  limit  the  size  of  the  operand  matrix. 
When  no  size  limit  is  imposed,  the  operand  matrix  has  to  be  partitioned  into  sub¬ 
matrices  of  equal  sizes.  Only  Johnsson  and  Pottle  treated  the  related  numerical 
properties.  However  special  matrix  structures,  such  as  the  Bordered  Block 
Diagonal  Form  (BBDF)  or  the  Bordered  Block  Triangular  Form(BBTF),  commonly 
expected  in  engineering  problem,  are  not  exploited  in  much  of  this  work.  In  [6], 
we  proposed  a  new  algorithm-architecture  BLOSSOM  for  the  solution  ef  LSE. 

This  architecture  supports  other  matrix  operations  used  as  subprocedures 
by  block  LU  decomposition  such  as  the  multiplication  and  the  inversion  of  sub¬ 
matrices.  We  described  the  hardware  implementation  of  these  matrix  opera¬ 
tions.  We  are  simulating  the  performance  of  the  proposed  architecture  and 
comparing  its  speed  and  power  with  other  architectures  such  as  data-flow 
machines.  In  addition,  we  are  studying  the  combinatorial  optimization  problem 
arising  form  the  optimal  partitioning  of  the  sparse  matrix.  W e  are  in  the  process 
of  developing  new  algorithms  for  determining  the  optimal  partition.  In  addition, 
we  are  investigating  numerical  techniques  to  guarantee  the  numerical  stability 
of  the  scheme  used  in  the  special  purpose  hardware,  a  unique  feature  of  BLOS¬ 
SOM. 


We  have  been  investigating  the  use  of  data-driven  computer  architectures 
for  use  in  circuit  simulation  since  1981.  Early  in  1982,  we  concluded  that 
relaxation-based  algorithms  are  well-suited  to  the  use  of  data-driven  multipro¬ 
cessors  for  the  solution  process.  Using  the  FTL2  program[7],  we  simulated  a 
variety  of  computer  interconnection  networks  while  running  a  distributed  ver¬ 
sion  of  the  SPLICE  ITA  algorithm.  Based  of  the  results  of  those  simulations,  we 
decided  that  for  our  problem,  the  the  multi-stage  perfect  shuffle  network,  or 
OMEGA  network,  was  the  best  choice.  Unfortunately,  we  could  not  analyze  large 
circuits  on  our  simulated  machine.  We  have  since  used  the  BBN  Butterfly  com¬ 
puter,  developed  at  BBN  for  DARPA,  as  a  vehicle  for  tuning  our  algorithms, 
analyzing  their  properties,  and  evaluating  performance  limits  of  the  intercon¬ 
nection  network.  The  Butterfly  consists  of  a  number  of  MC68000  processors, 
AMD2901-based  processors,  and  memory,  connected  using  an  OMEGA  net.  The 
machine  we  used  had  10  processing  nodes  but  machines  with  up  to  128  proces¬ 
sors  sure  planned  at  BBN. 

Our  results  for  the  practical  runs  supported  our  earlier  analysis.  For  the 
analysis  of  an  industrial  circuit  containing  over  700  MOSFETS  we  achieved  70% 
efficiency  on  the  10-processor  machine.  This  compares  well  with  the  maximum 
10-15%  efficiency  we  have  been  able  to  achieve  using  direct  analysis  techniques 
on  vector  processors  such  as  the  CRAY  1.  We  are  presently  awaiting  our  own  16- 
processor  Butterfly  to  continue  this  research  and  have  developed  a  floating¬ 
point  support  board  for  the  machine.  We  also  plan  to  add  additional  memory  to 
each  node.  We  estimate  that  with  1MIP  processors  at  each  node,  with  256  nodes 
in  such  a  machine,  we  would  be  able  to  achieve  routine  1000-fold  performance 
improvement  over  SPICE2  on  a  single  processor  for  the  analysis  of  large  circuits. 
The  initial  results  of  this  research  will  be  reported  in  [8]. 


2.7.  Simulated  Annealing  Algorithms  for  Placement  and  Routing  (A. 
Sangiovanni-Vincentelli,  R.  Newton) 

Simulated  Annealing  is  a  relatively  new  technique  proposed  by  Kirckpatrick, 
Gelatt  and  Vecchi  of  the  IBM  T.J.  Watson  Research  Center  for  the  solution  of 
complex  combinatorial  optimization  problems.  This  technique  has  been  intro¬ 
duced  by  establishing  an  analogy  between  combinatorial  optimization  and  the 
annealing  process.  Algorithms  based  on  this  technique  have  been  developed  for 
the  placement  and  routing  problem.  We  believe  that  this  technique  has  great 
potential  for  the  solution  of  the  placement  problem  for  IC  design.  In  particular, 
its  simplicity  and  flexibility  associated  with  its  property  of  not  being  necessarily 
stuck  at  a  local  optimum  (which  is  a  common  drawback  of  the  heuristic  used  in 
layout  problems)  has  attracted  the  attention  of  a  number  of  researchers  in  the 
Universities  as  well  as  in  Industry.  We  have  developed  a  set  of  algorithms  and 
corresponding  packages  for  the  placement  of  gate-arrays,  standard  cells  and 
macro  cells.  We  have  recently  tested  the  results  obtained  by  the  algorithms  on  a 
set  of  complex  industrial  circuits.  In  all  cases,  a  significant  reduction  of  chip 
area  over  the  area  obtained  by  other  optimization  techniques  and  the  one 
obtained  by  hand-layout  has  been  achieved[5],  The  only  drawback  of  this  tech¬ 
nique  is  long  running  time.  In  the  largest  example  tried  thus  far,  a  1,500  cell  lay¬ 
out,  a  total  reduction  of  chip  area  of  the  order  of  37%  has  been  obtained  at  the 
expenses  of  24  hours  of  computing  time  on  a  Vaxl  1/780  running  VMS. 

To  the  best  of  our  knowledge,  there  is  no  theory  available  today  to  justify 
the  good  performance  of  the  algorithm  besides  the  physical  analogy  mentioned 


above.  We  are  presently  trying  to  apply  Markov  chain  theory  to  prove  that  in 
ideal  situation  the  algorithm  produces  the  global  optimum  with  probability  one. 
The  results  of  the  theoretical  investigation  should  provide  the  necessary  founda¬ 
tions  of  the  method  in  addition  to  new  vistas  on  methods  to  reduce  the  comput¬ 
ing  time. 
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2.8.  Tradeoffs  in  Wire  Representation  for  VLSI  Layouts  (C.H.  Sequin) 
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Most  VLSI  chips  are,  and  will  be  even  more  so  in  the  future,  dominated  by 
their  interconnections.  Correspondingly,  the  wiring  task  dominated  the  actual 
layout  tasks  in  the  RISC  and  SOAR  chips.  Effective  tools  for  interactive  or 
automatic  wiring  of  VLSI  chips  are  thus  the  most  important  missing  link  in  our 
UNIX  design  environment.  The  basis  of  any  tool  that  has  to  deal  with  intercon¬ 
nections  is  a  suitable  underlying  semantic  model  and  appropriate  data  struc¬ 
tures  for  wires. 

We  have  studied  tile-based  wire  representations  in  the  context  of  the  space 
partitioning  by  comer-stitched,  tiles  introduced  by  Ousterhout  [l].  The 
representations  studied  can  be  divided  into  three  broad  classes: 

Skeletal  representations  model  a  wire  as  a  chain  of  connected  line  segments 
with  attached  attributes  such  as  wire  width  and  type  of  material. 

Purely  physical  representations  model  the  physical  space  occupied  by  wire 
material  without  considering  the  segment  structure  of  the  wire  and  without 
explicitly  representing  connectivity. 

Directed  box  representations  attempt  to  combine  the  best  features  of  skeletal 
and  purely  physical  representations  by  tiling  the  physical  space  occupied  by  the 
wire  in  ways  that  preserve  a  simple  mapping  between  the  tiles  and  the  individual 
segments  of  the  wire. 

All  representations  have  some  pros  and  cons.  Skeletal  representations 
cleanly  express  connectivity  but  require  more  effort  to  identify  potential 
interactions  between  objects.  Purely  physical  representations  fit  most  naturally 
into  the  framework  of  corner-stitched  tiles  but  are  less  suitable  for  expressing 
and  manipulating  the  underlying  segment  structure  of  the  wire.  The  directed 
box  representations  turned  out  to  be  more  limited  than  expected  in  their  ability 
to  express  the  ways  in  which  wires  connect  to  each  other,  to  terminals,  and 
through  contacts  to  other  levels. 

An  experimental  program  for  incremental  wire  manipulation,  WICRD  [2],  has 
been  built  to  support  this  analysis  and  to  study  tradeoffs  in  the  human  interface 
to  an  IC  routing  tool.  It  explored  in  detail  both  the  advantages  and  the  ultimate 
limitations  of  the  directed  box  wire  representation  used  in  WICRD.  We  have 
integrated  a  simple  Lee  maze  router  in  WICRD  and  examined  the  routing  prob¬ 
lems  that  arise  when  wires  are  of  varying  width  and  when  they  can  exceed  the 
spacing  of  the  wire-placement  grid  typically  used  for  this  algorithm.  The  first 
approach  explored  moved  a  larger  square  of  wiring  material  through  the  layout 
while  checking  for  violations  of  any  constraints.  An  analysis  of  the  space  and 
time  requirements  of  this  algorithm  then  favored  a  second  approach,  in  which 
the  given  (obstacle)  geometry  is  suitably  enlarged  to  reduce  the  problem  of 
finding  a  routable  path  to  the  conventional  problem  for  a  minimum  width  wire. 
The  WICRD  system  has  also  demonstrated  that  an  incremental  wire  movement 
algorithm  can  be  used  effectively  for  global  compaction. 
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3.  CIRCUIT  Sc  SYSTEM  DESIGN 


3.1.  Design  Frames  and  System  Kits  (R  H.  Katz) 

A  design  frame  is  the  hardware  analog  of  a  software  operating  system.  It 
provides  the  necessary  run-time  support  for  providing  system-level  services  to  a 
user  designed  integrated  circuit.  It  consists  of  a  standard  pad  frame,  interface 
circuitry,  and  printed  circuit  board  to  accept  the  chip  footprint.  We  have  been 
experimenting  with  a  design  frame  for  the  MultiBus  system  bus. 

A  number  of  projects  from  the  Fall  offering  of  the  Mead  and  Conway  design 
course  were  designed  for  inclusion  within  the  MultiBus  design  frame,  and  were 
submitted  to  MOSIS  for  fabrication  in  January.  These  include  several  simple 
implementations  of  the  same  16-bit  microprocessor  architecture  and  a  test 
configuration  chip.  A  MultiBus  design  frame  board  has  been  designed  and  fabri¬ 
cated  by  the  MOSIS  service.  At  this  time,  we  still  await  the  return  of  the  course 
project  chips. 

With  the  arrival  of  the  printed  circuit  boards,  we  have  just  begun  serious 
testing  with  a  previously  fabricated  test  circuit  within  the  frame.  We  hope  to  be 
able  to  report  on  our  successful  integration  of  the  frame  chip  and  board  within 
the  SUN  workstation  by  the  time  of  the  DARPA  meeting. 

A  higher  performance  version  of  the  design  frame  is  currently  being 
designed  by  a  student  (Gaetano  Borriello)  that  corrects  a  number  of  the  perfor¬ 
mance  problems  encountered  during  last  semester's  course.  In  addition, 
several  projects  are  underway  to  build  real  systems  within  the  design  frame  con¬ 
text.  One  student  (Frankie  l^ung)  is  building  a  broadside  integrated  circuit  tes¬ 
ter  controller  around  the  design  frame.  A  controller  within  the  chip  level  frame 
will  (l)  download  on-board  test  vector  memory  as  a  DMA  device  on  the  MultiBus, 
(2)  cycle  through  the  test  vectors,  and  (3)  upload  the  test  results.  Another  stu¬ 
dent  (Rick  Brown)  is  building  a  scan  logic  controller  as  an  extension  of  the 
configuration  chip  he  designed  last  semester.  This  chip  will  provide  a  flexible 
means  to  read  and  write  a  scan  path  chained  through  a  number  of  custom 
designed  parts.  The  design  frame  is  being  extended  with  a  scan  logic  subsystem 
for  easier  interfacing  with  the  scan  logic  control  chip. 

A  paper  is  currently  in  preparation  describing  the  design  frame  concept, 
the  course  results,  and  future  directions.  The  paper  is  being  written  in  colla¬ 
boration  with  Gaetano  Borriello,  Alan  Bell  of  Xerox  PARC,  and  Lynn  Conway  of 
DARPA. 


3.2.  A  1000  Word  Speech  Recognition  System  (RW.  Brodersen) 

Circuitry  contained  on  one  multibus  card  has  been  developed  which  will  be 
able  to  recognize  1000  words  in  real  time.  On  the  card  are  2  custom  I.C.’s,  an 
Intel  80186  microcomputer  and  memory.  This  card  has  the  capability  of  116 
MIPS  of  von  Neumann  equivalent  instructions,  thus  demonstrating  the  power  of 
dedicated  parallel,  pipelined  processing. 

One  of  the  IC’s  is  a  fllterbank  chip  which  was  generated  fully  automatically 
from  software.  This  chip  implements  a  16  channel  fllterbank  with  a  112  poles  of 


filtering,  at  an  initial  sample  rate  of  14  kHz. 

The  2nd  chip  is  an  enhanced  version  of  a  previous  design,  which  performs  a 
dynamic  programming  algorithm.  The  new  chip  has  more  parallelism  as  well  as  a 
number  of  glue  logic  functions  for  memory  control  which  were  required  to  be 
able  to  fit  the  entire  recognition  system  on  one  board. 

The  186  performs  the  multibus  interface  and  will  be  used  in  future  research 
to  implement  such  things  as  syntax  direction,  continuous  speech  algorithms  and 
sophisticated  training  (learning)  algorithms.  The  flexibility  of  this  board  makes 
it  an  ideal  candidate  as  a  base  for  future  speech  recognition  research. 

We  have  put  this  card  into  a  SUN  (UNIX  based)  workstation,  and  plan  to  use 
it  to  incorporate  speech  into  a  number  of  applications.  The  UNIX  drivers  which 
allow  direct  interaction  between  the  SUN  cpu  and  the  80196  cpu  on  the  board 
have  been  written.  The  software  is  now  being  developed  which  will  allow  a 
straight-forward  high  level  interaction  between  application  programs  written  on 
the  SUN  and  the  recognition  board.  Features  such  as  adaptive  training  algo¬ 
rithms,  syntax  direction  and  multiple  vocabularies  will  be  supported. 

Discussions  are  underway  with  several  companies  to  produce  and  support 
the  distribution  of  the  board.  In  addition,  SRI  have  generated  a  wirelist  for  the 
board  to  allow  duplication  of  the  board  wiring  by  an  automatic  wire-wrap 
machine.  It  is  expected  that  10-20  boards  will  be  made  in  this  way  for  internal 
distribution  at  Berkeley  and  SRI. 
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3.3.  Real-time  on-line  handwriting  recognition  (R.  W.  Brodersen) 

This  project  involves  the  design  of  an  on-line  handwriting  recognition  sys¬ 
tem  to  handle  more  than  500  custom  symbols  in  real-time.  The  system  consists 
of  four  major  blocks:  feature  extraction,  training,  prematching,  and  postmatch¬ 
ing.  The  algorithms  for  feature  extraction,  prematching  and  postmatching  have 
been  developed.  Recent  progress  has  been  in  the  following  areas: 

[1]  The  algorithms  have  been  successfully  implemented  using  the  speech 
recognition  hardware.  Both  accuracy  and  response  time  are  satisfactory. 
We  are  now  confident  that  handwriting  recognition  can  be  integrated  with 
speech  recognition  in  the  same  hardware  without  significant  degradation  in 
the  character  recognition  accuracy. 

[2]  The  prematching  algorithm  has  been  modified  so  that  it  takes  more  advan¬ 
tage  of  the  real-time  processing  capability  of  the  speech  processor 
hardware. 

[3]  A  clustering  technique  has  been  incorporated  in  the  training,  which  allows  a 
significant  decrease  in  templates  without  a  decrease  in  accuracy. 

Future  work  will  involve  the  implementation  of  the  algorithms  in  the  new 
speech  recognition  system  and  to  write  application  programs  which  make  use  of 
the  new  form  of  interaction. 
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3.4.  Computer  Generation  of  Digital  Filter  Banks  (R.W.  Brodersen) 

A  software,  hardware  system  has  been  developed  to  automatically  generate 
digital  filter  bank  circuits  from  high  level  filter  descriptions.  This  system  now 
has  been  fully  documented  [1,2],  and  discussions  are  underway  with  several 
companies  to  give  them  access  to  the  programs. 

The  software  consists  of  two  main  programs:  The  filter  compiler  which  con¬ 
verts  the  filter  descriptions  to  hardware  descriptions  (micro  code,  RAM  size,  etc) 
and  the  layout  generator  which  converts  the  hardware  descriptions  to  a  layout 
(mask)  descriptions.  To  check  the  algorithms  before  the  circuit  is  fabricated,  a 
test  set  is  used  which  runs  the  micro  code  on  a  data  path  the  same  as  that 
included  in  the  final  circuit. 

The  major  difference  between  this  system  and  other  automated  design 
efforts  is  that  this  software  and  hardware  was  optimized  for  the  specific  groups 
of  DSP  applications  of  digital  filter  banks.  This  resulted  in  a  software  design 
time  of  only  one  month. 

A  number  of  circuits  have  been  generated,  fabricated  and  tested.  A  small 
single  bandpass  filter  chip,  a  18  channel  spectrum  analyzer  for  a  speech  recog¬ 
nizer  which  is  being  used  in  a  speech  recognition  system  and  a  18  channel  spec¬ 
trum  analyzer  for  consumer  stereo  have  been  fabricated.  All  circuits  were  func¬ 
tional  the  first  time  (the  tester  caught  all  errors)  with  a  very  short  time 
required  for  filter  specification  and  testing  (approximately  1  day). 
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3.5.  Macrocells  for  Signal  Processing  (R.W.  Brodersen) 

The  primary  area  of  investigation  is  research  in  architectures  and  VLSI  cir¬ 
cuits  suitable  for  signal  processing  applications. 

A  single-chip  linear-predictive  coding  (L.P.C.)  vocoder  circuit  has  been 
designed  and  tested  [lj.  This  is  a  device  used  for  digitally  encoding  speech  at  a 
low  bit-rate.  Test  results  have  shown  that  the  device  correctly  performs  the 
L.P.C.  algorithm  and  functions  reliably  as  a  real-time,  full  duplex  vocoder. 
Arrangements  have  been  made  with  an  industrial  company  experienced  in 
speech  coding  to  perform  critical  evaluations  of  the  speech  quality  with  the  dev¬ 
ice  configured  into  an  LPC-10  coding  system. 


Another  circuit  which  has  recently  been  completed  is  a  speech  synthesizer 
which  implements  a  recently  developed  multirate  root  LPC  algorithm  [2].  This 
algorithm  uses  a  closed  form  analysis  and  a  formant  like  structure.  By  using 
quadratic  coefficients  and  section  repeat  its  data  rate  can  be  lower  than  1kbps. 
The  quality  can  be  continuously  improved  (with  increasing  bit  rate)  by  including 
a  representative  residual. 

Further  work  is  being  done  in  the  area  of  signal  processing  l.C.  design. 
Appropriate  architectures  and  computer-assisted  design  techniques  are  being 
investigated.  The  use  of  multiprocessor  architectures  and  macrocell-based  cir¬ 
cuit  design  techniques  are  being  actively  studied.  Families  of  signal  processing 
applications,  such  as  speech  processing  and  telecommunications,  have  been 
identified  as  suitable  for  implementation  using  these  techniques.  The  goal  of 
this  research  is  to  develop  the  tools  and  resources  needed  for  rapid  design  of 
semi-custom  I.C.’s  for  these  applications. 
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high  for  both  As+  implanted  and  unimplanted  specimens,  indicating  Ta  is 
inadequate  as  a  diffusion  barrier  at  500°  for  Ai-Si  contacts. 

Current  research  work  concentrates  on  the  Ti-Si  system  and  the  application 
of  rapid  pulse  lamp  annealing  to  form  refractory  silicide*Si  contacts. 


and  the  later  is  due  to  A1  spiking.  Our  test  results  shows  that  open  failure  is 
observed  in  both  Al/n+  and  Al/p+  contacts  while  leakage  failure  is  observed  only 
in  Al/n+  contact.  SEM  examinations,  however,  show  contact  pits  formation  in 
both  cases.  A  good  explanation  for  that  is  A1  contact  to  n-substrate  forms  a 
Shottky  diode  while  A1  contact  to  p-  substrate  creates  a  s  lort.  Therefore,  no 
junction  leakage  increase  is  seen  in  Al/p+  contact. 

The  total  contact  chain  resistance  is  also  monitored.  In  Al/n+  case,  such 
resistance  decreases  slightly  initially  and  then  increases  until  open-circuited. 
Such  initially  increase  is  due  to  barrier  height  reduction.  fn  Al/p+  case,  the 
total  resistance  remains  constant  initially  and  then  starts  increasing.  In  either 
case,  once  the  total  resistance  starts  increasing,  the  contact  chain  opens  up  in  a 
very  short  period.  SE\  examinations  show  the  open  occurs  >n  or  near  the  con¬ 
tact  closest  to  the  anncde,  depending  on  metal  compositions. 

As  low  value  of  specific  contact  resistivity  is  desirable  n  VLSI  technology, 
there  is  a  concern  on  the  lifetime  of  contact  electromigrat  on.  The  lower  the 
specific  contact  resistivity,  the  higher  the  current  density  near  the  contact 
edge.  Higher  current  density  may  cause  contact  fails  in  a  shorter  period.  A 
study  to  investigate  this  trade-off  is  currently  underway. 


4.10.  Modification  of  iretal-Si  Contacts  With  Ion  Beams  (N.  Cheung) 

We  have  complete!  the  study  of  two  metallization  systems:  the  Al-Si  and 
Ta-Si  contacts.  For  the  Al-Si  system,  a  contact  resistivity  ,  pc,  of  1.5  x  10’b 
0-cm2  has  been  determined.  After  implantation  of  As  (300  KeV)  up  to  a  dose  of 
10 1  /cm2  and  a  post-implantation  annealing  at  450°C,  the  substrate  leakage 
current  is  substantially  reduced.  The  effect  can  be  attribute  d  to  the  increased 
interface  uniformity  of  the  implanted  contacts  when  compared  to  their 
unimplanted  counterparts.  The  smoother  interface  morphology  of  the 
implanted  interface  has  also  been  verified  by  scanning  electron  microscopy. 

The  contact  resistivity  of  Ta-Si  contacts  is  2.5  x  3 0- — cm2  for  the 
unimplanted  samples  after  annealing  at  600°C.  The  contact  resistance,  Rc,  was 
found  to  be  proportion il  to  (i4rea)  .  This  power  dependence  is  consistent  with 
the  current  crowding  effect  under  the  leading  edge  of  tie  contact,  and  it 
reduces  the  effective  current  carrying  area.  After  As  implantation  (400  KeV)  to 
a  dose  of  5  x  10is/cm2,  pc  increases  by  an  order  of  magnitude  to  20  x 
lO^fl-cm2.  The  substrate  leakage  current  for  both  implantc  d  and  unimplanted 
contacts  is  much  lower  than  that  found  for  Al-Si  contacts.  For  the  implanted 
Ta-Si  contacts,  there  i3  no  indication  of  any  contact  pittirg  or  uneven  Ta/Si 
reaction.  The  increased  contact  uniformity  is  attributed  to  ion-beam  mixing 
effects  (i.e.,  the  dispersion  of  interfacial  contaminants  tnd  atomic  mixing 
promote  the  uniform  reaction  of  the  Ta/Si  interface). 

A  reliability  study  if  Ta  as  a  thin-film  diffusion  barrier  he  s  also  been  carried 
out.  The  samples  were  prepared  by  depositing  150A  of  Ta  ovsr  the  entire  wafer. 
The  Ta  film  was  then  reacted  with  the  substrate  Si  by  As  "  implantation  and 
post-annealing.  The  wafer  was  then  subjected  to  a  selective  2tch  to  remove  the 
unreacted  Ta  over  SiO%  regions.  A  thick'  layer  of  A1  was  evaporated  and 
patterned  using  a  lift-cff  process.  The  wafer  was  then  annea  ed  at  500°C  for  60 
minutes.  Substrate  leakage  currents  taken  from  2p.m  daisy  chain  patterns  are 


Albert  Wu  has  made  several  specific  contributions  to  this  work.  One  is  a  com¬ 
puter  program  for  automatically  generating  the  layout  rules  from  process  bias 
information.  He  is  also  introducing  new  technology  elements  .ncludii  g  a  PRIST 
technique  for  protecting  the  poly  during  the  p-channel  source  ind  drain 
implant. 


4.7.  Simulation  Aids  for  Viewing  Topography  from  the  Layout  (A.  N<  ureuther) 


A  major  milestone  was  accomplished  in  the  completion  of  SIMPL-1  which  is 
the  CAD  tool  for  Simulating  Profiles  from  the  layout  using  rectangul  ir  shapes. 
SIMPL-1  uses  a  file  of  process  steps  and  the  CIF  layout  information  to  automati¬ 
cally  generate  an  SEM  like  view  of  the  device  topography.  Any  1C  process 
sequence  earn  be  simulated  by  first  describing  the  process  steps  from  a  menu  of 
process  elements.  Examples  of  double  N-well  CMOS  and  Bipolar  devices  were 
given  at  the  IEDM  meeting.  The  documentation  was  completed  in  the  MS  report 
of  Mike  Grimm  in  December  1983  and  a  tape  is  being  prepared  for  distribution. 
Work  is  being  continued  by  Keunmyung  Lee  on  data  structures  for  a  polygonal 
version  and  interface  to  other  process  simulators  such  as  SAMPLE  and  SUPREM 
under  SRC  support. 


4.8.  AL/SI  Specific  Contact  Resistivity  Measurement,  (W.G.  Oldham) 

We  have  demonstrated  that,  using  a  properly  designed  test  patter  n,  specific 
contact  resistivity  can  be  consistently  determined  from  contact  end  resistance 
using  a  transmission  line  model. 

The  results  show  that,  for  a  given  junction  doping  and  a  drive-in  cycle,  the 
specific  contact  resistivities  are  a  constant  regardless  the  sizes  ol  contacts. 
Such  measurement  consistency  and  contact  geometry  independence  have  not 
been  shown  before  by  either  conventional  method  or  other  methods  published. 

The  dependence  of  specific  contact  resistivity  on  junction  surface  doping  is 
also  shown  in  agreement  to  that  predicted  by  tunneling  theory.  Such  agreement 
is  good  even  when  junction  surface  doping  extends  beyond  commonl)  accepted 
electrically-active  solid-solubility. 


4.9.  AL/SI  Contact  Electromigration,  (W.G.  Oldham) 

The  failures  of  Al/n+  and  Al/p+  contact  due  to  current  stressing  are  being 
studied.  In  this  study,  an  automatic  measurement  setup  based  on  IBM  Personal 
Computer  is  configured.  In  order  to  minimized  temperature  rise  due  to  current 
heating,  tests  are  conducted  on  test  chips  placed  on  a  wafer  hot  chuck  with  a 
temperature  controller. 

There  are  two  types  of  failures  which  can  result  from  the  stress,  i.e.  open 
circuit  and  junction  leakage  failures.  The  former  is  due  to  metal  lit  e  opening 
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substrate  resistance  including  the  effects  of  guard  rings.  A  ne?  technique  for 
suppressing  latch-up  was  developed.  A  high  energy  (4MeV)  implant  of  boron  pro¬ 
duces  a  buried  p+  layer  as  a  substitute  for  the  widely  used  p  in  p+  epitaxial 
substrate.  Since  the  high  energy  implant  is  performed  after  the  long  well 
diffusion,  the 

"  epitaxial  "  layer  thickness  can  be  scaled  more  easily  and  is  xpected  to  be 
more  uniform.  Latch-up  improvement  is  shown  to  match  that  o~  epitaxial  sub¬ 
strates. 


References: 

(1)  K.  Terrill,  C.  Hu,  "  Substrate  Potential  Calculation  for  Latch -up  Modeling.  " 
accepted  for  publication  IEEE  Electron  Detrice  Letters. 

(2)  K  Terrill,  C.  Hu,  ”  Prevention  of  CMOS  Latch-up  by  High  Energy  Implanta¬ 
tion,  "  submitted  to  IEEE  Electron  Device  Letters. 


4.5.  Soft  Error  Studies  (C.  Hu,  A.  Neureuther) 

We  published  the  computer  analysis  of  the  collection  of  alpha  generated 
charge  by  collectors  surrounded  by  either  uniform  reflecting  or  uniform  absorb¬ 
ing  surfaces.  These  are  the  two  extreme  cases  of  any  real  condit  ons  that  exists 
in  IC.  The  analysis  of  the  upper  limit  of  charge  collection  should  ne  more  useful 
for  circuit  design  than  the  previously  available  lower  limit.  The  collected  charge 
scales  linearly  with  the  linear  dimension  of  the  collector. 


References: 

(l)  K.  Terrill,  C.  Hu,  A.  Neureuther,  "  Computer  Analysis  on  the  Collection  of 
Alpha-Generated  Charge  for  Reflecting  and  Absorbing  Surface  Conditions 
around  the  Collector,  "  Solid  State  Electronics,  No.l  Januar/  1984,  pp.  45- 
52. 


4.6.  Berkeley  Advanced  CMOS  (A-  Neureuther) 

A  collective  effort  is  being  made  to  bring  up  our  advanced  C  -IOS  process  in 
our  new  4  "  facility  at  Berkeley.  Albert  Wu  under  DARPA  suppcrt  is  making  a 
major  contribution  to  this  effort  as  the  process  coordinator.  Sevc  ral  runs  of  the 
double  N-well  process  on  2  "  wafers  have  been  completed  by  <yle  Terrill  for 
latch-up  studies  and  the  device  characteristics  have  been  reasonable  for  chan¬ 
nel  lengths  under  2  um.  This  process  is  being  modified  to  create  n  base  line  pro¬ 
cess  which  is  suitable  for  both  the  linear  and  digital  needs  of  rest  arch  groups  in 
the  Department,  appropriate  for  our  4  "  equipment  and  free  of  process 
bottlenecks  such  as  occur  with  implants.  This  effort  has  been  greatly  aided  by 
course  work  activities  under  the  direction  of  professors  Oldham  a  id  Neureuther. 
During  the  Fall  Semester  20  students  from  the  class  designed  tesl  structures  for 
device,  process  and  yield  on  the  2x10  pad  base  for  automatic  testing.  This 
semester  10  students  are  making  the  first  run  of  4  "  wafers  tt  rough  the  lab. 
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much  too  short  a  lifetime  for  operation  at  7V  or  less. 

Future  work  will  fully  clarify  the  mechanism  for  interface  degradation  and 
TDDB  above  and  below  7V.  (Alpha-particle,  we  now  hypothesize,  may  induce 
oxide  breakdown.)  Methods  for  predicting  oxide  reliability  will  be  developed. 


References: 

(1)  C.  Hu,  (Invited  paper),  "  Charge  Tunneling,  Trapping,  and  Device  Degrada¬ 
tion  in  Thin  Si02,  "  1983  IEEE  Surfaces  and  Interfaces  Specialists  Confer¬ 
ence,  Fort  Lauderdale,  Florida,  December  1-3,  1983. 

(2)  M.S.  Liang,  C.  Chang,  W.  Yang,  C.  Hu,  R.W.  Brodersen,  "  Hot-Carriers  Induced 
Degradation  in  Thin  Gate  Oxide  MOSFETs  "  Technical  Digest  of  1983  IEEE 
International  Electron  Devices  Meeting,  Washington  D.C.,  December  1983, 
pp.  106-189. 

(3)  C.  Chang,  M.S.  Liang,  C.  Hu,  R.W.  Brodersen,  "  Carrier-Tunneling  Related 
Phenomena  in  Thin  Oxide  MOSFET’s,  ”  Technical  Digest  of  1983  IEEE  Inter¬ 
national  Electron  Devices  Meeting,  Washington,  D.C.,  December  1983,  pp. 
194-197. 

(4)  S.  Holland,  I.C.  Chen,  T.P.  Ma,  C.  Hu,  ”  On  Physical  Models  for  Gate  Oxide 
Breakdown,  "  submitted  to  IEEE  Electron  Device  Letters 


4.3.  Device  Characterization  and  Modeling  (C.  Hu,  P.K.  Kb.  R.S. Muller) 

A  well-equipped  characterization  laboratory  has  been  set  up.  Besides  hot- 
electron  studies  and  thin  oxide  research  described  above,  this  laboratory  is 
being  used  to  study  the  charge  pumping  technique  for  MOS  interface  characteri¬ 
zation.  A  simple  quantitative  model  for  the  switch-induced  error  in  switch  capa¬ 
citor  circuits  has  been  developed  for  the  first  time.  A  new  MOSFET  model  suit¬ 
able  for  CAD  has  been  developed,  and  will  be  tested  in  a  special  version  of  SPICE. 
Research  is  underway  to  model  and  characterize  the  lightly-doped-drain  (LDD) 
transistors. 


References: 

(1)  B.J.  Sheu,  C.  Hu,  ”  Modeling  the  Switch-Induced  Error  Voltage  on  a  Switched 
Capacitor,  "  IEEE  Transactions  on  Circuits  and  Systems,  CAS-30  No.  12, 
December  1983,  pp. 91 1-913. 

(2)  B.J.  Sheu,  C.  Hu,  "  Switch-Induced  Error  Voltage  on  a  Switched  Capacitor,  " 
accepted  for  publication  in  IEEE  Journal  of  Solid  State  Circuits. 

(3)  B.J.  Sheu,  D.L.  Scharfetter,  C.  Hu,  D.  0.  Pederson,  "  A  Compact  IGFET 
Charge  Model,  "  submitted  to  IEEE  Circuits  and  Systems  Letters. 


4.4.  CMOS  Latch-up  Studies  (C.  Hu) 

While  qualitative  models  for  CMOS  latch-up  abound,  there  is  little  research 
to  quantify  the  problem.  We  have  developed  a  model  for  the  (distributed) 


4.  TECHNOLOGY 


4.1.  Hot  Electron  Studies  (C.  Hu,  P.K.  Kb,  R.S.  Muller) 

A  comprehend  review  of  the  results  of  our  research  under  the  sponsor¬ 
ship  of  DARPA  was  presented  at  th  1983  IEDM  in  an  invited  paper  titled  "  Hot- 
Electron  Effects  in  MOSFETs  "  .  A  special  effort  was  made  to  publicize  our 
models  in  terms  that  are  easily  understood  by  IC  engineers.  A  quantitative 
theory  of  photon  generation  by  hot  electrons  with  excellent  agreement  with 
experiment  was  completed  and  has  been  accepted  for  publication.  Photon  emis¬ 
sion  was  also  observed  from  forward-biased  pn  junctions.  An  in-depth  study  of 
our  lucky-electron  model  for  channel  hot  electron  emission  was  written  up  and 
has  been  accepted  for  publication.  We  have  observed  significant  channel-width 
dependence  of  hot-electron  effects  and  are  preparing  a  report  on  it.  An  MS  and 
aPh.D.  thesis  were  completed  during  this  report  period. 

Future  research  will  determine  the  mechanisms  responsible  for  MOSFET 
degradations  and  search  for  a  reliable  method  to  forecast  device  lifetime  and 
screen  unreliable  devices.  We  will  also  study  new  device  structures  that  minim¬ 
ize  the  hot  electron  effects. 

References 

(1)  C.  Hu  (invited  paper),  "  Hot-Electron  Effects  in  MOSFET’s  "  ,  Technical  Dig¬ 
est  of  1983  IEEE  International  Electron  Devices  Meeting,  Washington,  D.C., 
December  1983,  pp.  176-181. 

(2)  S.  Tam,  C.  Hu,  "  Hot  Electron  Induced  Photo-Carrier  Generation  in  Silicon 
MOSFETs  "  ,  accepted  far  publication  in  IEEE  Transactions  on  Electron 
Devices. 

(3)  T.C.  Ong,  K  Y.  Terrill,  S.  Tam,  C.  Hu,  "  Photo  Generation  in  Forward-biased 
Silicon  PN  Junctions,  "  IEEE  Electron  Device  Letters,  EDL-4,  No.  12, 
December  1983,  pp.  480-462. 

(4)  S.  Tam,  P.K.  Ko,  C.  Hu,  "  Lucky-Electron  Model  of  Channel  Hot  Electron 
Injection  in  MOSFET's  "  ,  accepted  far  publication  in  IEEE  Transactions  in 
Electron  Devices. 


4.2.  Thin  Oxide  Studies  (C.  Hu,  R.W.  Brodersen) 

We  were  invited  to  review  the  results  of  our  DARPA-sponsored  research  on 
the  charge,  transport,  trapping,  and  degradations  in  thin  oxides  at  the  14th 
Semiconductor  Interface  Specialists  Conference.  Two  papers  were  presented  at 
the  1983  IEDM.  One  addressed  the  physical  nature  of  hot-carrier  induced  dam¬ 
age  in  MOS  systems  and  experimental  proof  that  hot  holes  play  a  negligible  role 
in  the  process.  The  other  presented  the  first  direct  measurement  of  the  quan¬ 
tum  yield  of  impact  ionization  as  a  function  of  electron  energy  among  other 
results.  A  full  paper  on  the  subject  is  being  prepared.  As  predicted  in  the  last 
semiannual  report,  much  progress  was  made  on  the  study  of  the  time-dependent 
dielectric  breakdown  of  oxides  in  this  period.  A  survey  was  presented  at  the 
1983  Wafer  Reliability  Assessment  Workshop.  One  completed  paper  reports  clear 
evidence  that  holes  generated  by  impact  ionization  are  responsible  for  TDDB. 
Other  known  models  such  as  one  based  on  electron  trapping  are  shown  to  be 
incorrect.  We  also  found  the  standard  method  of  accelerated  testing  to  predict 
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ARCHITECTURE 


The  following  section  contains  papers  and  reports  relating  to  research  in  com¬ 
puter  Architecture.  They  describe  work  which  was  wholly  or  in  part  performed 
under  the  sponsorship  of  the  DARPA  grant. 


(1)  R.W.  Sherburne,  M.G.H.  Katevenis,  D.A.  Patterson,  and  C.H.  Sequin,  "A  32b 
NMOS  Microprocessor  with  a  Large  Register  File",  ISSCC  1984,  Digest  of 
Tech.  Papers,  San  Francisco,  Feb.  1984,  pp.  168-169. 


(2)  M.G.H.  Katevenis,  "Reduced  Instruction  Set  Computer  Architecture  for 
VLSI",  Ph.D  thesis,  U.C.  Berkeley,  Oct.  1983. 


(3)  R.W.  Sherburne,  "Processor  Design  Tradeoffs  in  VLSI",  Ph.D  thesis,  U.C. 
Berkeley,  (~ April,  1984). 


(4)  R.W.  Sherburne,  M.G.H.  Katevenis,  D.A.  Patterson,  and  C.H.  S6quin,  "Local 
Memory  in  RISCs",  Proc.  Intnl.  Conf.  on  Computer  Design,  New  York,  Oct. 
31-Nov.  3,  1983,  pp  149-152. 


(5)  Y.  Tamir  and  C.H.  Sequin,  "Strategies  for  Managing  the  Register  File  in 
RISC".  IEEE  Trans,  on  Computers.  C-23,  No.  11,  Nov.  1983,  pp  977-989. 


(6)  W-H  Yu,  "LU  Decomposition  on  a  Multiprocessing  System  with  Communica¬ 
tion  Delay",  Ph.D.  Thesis,  University  of  California,  Berkeley,  CA,  March  1984. 


(7)  H-H  Lu,  "High  Speed  Recursive  Filtering",  Ph.D.  Thesis,  University  of  Califor¬ 
nia,  Berkeley.  CA,  Dec.  1983. 


(8)  R.M.  Fujimoto,  "VLSI  Communication  Components  for  Multicomputer  Net¬ 
works,"  Ph.D  Thesis,  U.C.  Berkeley,  Fall  1983. 


(9)  R.M.  Fujimoto.  "SIMON,  A  Simulator  of  Multicomputer  Networks,"  Tech. 
Report  No.  UCB/CSD  83/140,  Sept.  1983. 


(10)  Y.  Tamir  and  C.H.  S6quin,  "Self-checking  VLSI  Building  Blocks  for  Fault- 
Tolerant  Multicomputers,"  Proc.  Intnl.  Conf.  on  Computer  Design,  New 
York,  Oct.  31-Nov.  3,  1983,  pp  561-564. 


(11)  Y.  Tamir  and  C.H  Sdquin,  "Design  and  Application  of  Self-Testing  Compara¬ 
tors  Implemented  with  MOS  PLAs,"  scheduled  for  publication  in  IEEE  7Van- 
s action  on  Computers,  Spring,  1984. 
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SESSION  XII:  MICROPROCESSORS  AND  MICROCONTROLLERS 

THAM  12.1:  A  32b  NMOS  Microprocessor  with  a  Large  Register  File* 

Robert  hV.  Sherburne,  Jr.,  Manolis  G.H.  Katevenis,  David  A.  Patterson,  Carlo  H.  Sequin 
University  of  California 
Berkeley,  CA 


Chairman:  Dana  Seccombe 

Hewlett-Packard  Co 
Ft.  Collins,  CO 


TWO  SCALED  VERSIONS  of  a  32b  Reduced  Instruction  Set 
Computer1  CPU  -  RISC  II  —  have  been  implemented  in  a 
3-mask  NMOS  process  using  layout  rules'  with  Lambda-values 
of  2/im  and  l.ojUm  (corresponding  to  drawn  gate  lengths  of  4/am 
and  3/irn.  respectively).  This  approach  has  resulted  in  two 
surprisingly  powerful  processors. 

The  RISC  II  architecture  uses  a  small,  but  carefully  selected 
set  of  simple  instructions  that  support  the  most  frequently 
occurring  operations.  It  relies  on  an  orthogonally  partitioned 
32b  instruction  format  and  regular  execution  timing,  affording 
a  reduction  of  the  amount  of  control  circuitry  required  on  chip. 
While  most  current  microprocessors  use  from  40%  to  70%  of 
the  chip  area  for  control,  in  RISC  II  the  control  section  occupies 
less  than  10%  of  the  chip.  The  regular  instruction  execution  also 
permits  the  realization  of  a  simple,  yet  efficient  pipeline.  In 
RISC  II.  the  machine  cycle  time  is  limited  only  by  the  speed 
of  the  datapath,  rather  than  by  a  long  control  path  through  a 
large  microstore. 

The  area  freed  up  by  the  simplification  of  the  control  logic 
has  been  used  to  add  a  significant  amount  of  local  memory  in 
the.  form  of  a  large  register  file,  organized  as  ten  global  registers 
plus  eight  bank  of  22  local  registers  each,  with  an  overlap  of 
6  registers  between  adjacent  banks.  These  register  banks  are  used 
to  support  procedures:  the  overlap  registers  are  used  for  para¬ 
meter  passing,  and  the  others  for  local  scalar  variables.  In  this  man¬ 
ner,  the  most  frequently  used  operands  reside  on  chip,  and  the  data 
memory  traffic,  which  often  constitutes  a  serious  limitation  of 
overall  system  performance,  is  significantly  reduced. 

The  CPU  chip  photomicrograph  is  shown  in  Figure  1,  Chip 
area  is  dominated  by  the  highly  regular  datapath.  More  than 
half  of  its  length  is  occupied  by  the  register  file  with  a  total  of 
138  32b  registers.  The  control  circuitry  lies  in  the  top  right 
portion  of  the  chip.  It  consists  of  a  few  small  PLAs.  rather 
than  a  microcode  ROM.  The  actual  opcode  decoder  (0.3%  of 
chip  area)  is  a  generalized  decoder  consisting  of  an  AND  plane 
and  a  single  row  of  OR  gates,  which  makes  it  smaller  and 
faster  than  a  full  PLA. 

The  organization  of  the  three-stage  pipelined  datapath  is 
shown  in  Figure  2.  It  is  based  on  a  2-bus.  dual-port  static 
register  file.  The  first  stage  of  pipelining  includes  instruction 
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fetch  and  decoding.  In  the  second  stage,  two  operands  are  read 
from  the  register  file  and  an  ALU  or  shift  operation  is  per¬ 
formed.  The  result  is  written  back  to  the  register  file  during 
the  third  stage;  page-fault  and  other  interrupts  and  traps  are 
bandied  simply  by  aborting  this  last  stage  of  the  pipeline,  as 
it  is  the  only  one  that  modifies  user-visible  state.  The  ALU 
or  shifter  result  may  be  forwarded  directly  to  the  next  executin' 
instruction  before  it  is  written  back  into  the  register  file.  Thu- 
data  dependency  delays  are  eliminated. 

On  branch  instructions,  the  pipeline  is  not  flushed:  transfer 
of  control  is  simply  delayed  by  one  pipeline  stage.  This  delay 
is  accommodated  in  software  by  defining  all  branch  instruc¬ 
tions  to  take  effect  only  after  the  subsequent  instruction. 

An  optimization  pass  of  the  coinplier  can  oftrn  put  a  useful 
instruction  into  this  slot  after  each  delayed  branch  instruction 

For  accessing  data  mrmory  external  to  the  chip,  normal 
instruction  fetch  and  execution  is  suspended  between  the 
second  and  third  pipeline  stages  to  make  the  single  I/O  bus 
available  for  data  traffic.  The  timing  within  each  four-phase 
cyclr  allows  up  to  three  phases  of  the  cycle  for  accessing 
external  memory. 

Dynamically  precharged  busses  are  used  wherever  possible. 
The  ALU  uses  adynamic  carry  chain  with  buffering  every 
four  bits.  A  single.  32b  crossbar  shifter  with  precharged, 
bidirectional  busses  is  included  for  both  right  and  left  shifts. 

Register  file  operation  is  illustrated  with  Figure  3.  All 
bitlines  are  initially  precharged  during  £4.  Simultaneously, 
the  word  linrs  are  clamped  to  ground,  and  address  decoding 
takes  place.  The  grounded  depletion  transistor  reduces  the 
delay  in  the  NOR  decoder  by  isolating  the  input  parasitics. 

The  full  SV  signal  is  delivered  to  the  wordline  driver  by  the 
clocked  depletion  mode  pass  transistor.  The  bootstrapped 
wordline  driver  is  clocked  during^.  The  selected  bit  cell 
then  may  discharge  one  of  its  associated  bitlines.  A  single- 
ended  sensing  scheme  was  chosen  due  to  its  simple  and  com¬ 
pact  realization.  Register  write  is  performed  in  the  usual 
manner  by  differentially  driving  the  bitline  pair:  the  decoder- 
are  shared  for  both  read  and  write  addresses. 

The  two  designs  were  implemented  through  silicon 
foundries* .  Both  chips  were  fully  functional  on  first  silicon. 
Design  verification  was  made  at  the  layout  level  (design  rule, 
electrical  rule,  and  label  checking),  the  circuit  level  (switch 
level  simulation,  delay  path  analysis,  and  SPICE  simulation 
of  critical  paths),  and  at  the  register  transfer  level  (SLANG). 

The  larger  chip,  for  which  all  the  simulation  and  delay 
analysis  was  carried  out.  ran  within  3%  of  expected  perform  nice 
Its  instruction  cycle  time  was  300ns  rather  than  480ns, 
corresponding  to  an  8MHz  clock.  The  scaled  down  chip,  (or 
which  no  additional  simulation  had  been  done,  performed 
with  a  machine  cycle  of  330ns  (1 2MHz  clock  rate).  Chip 
specifications  are  summarized  in  Table  1. 
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Overall,  the  RISC  approach  has  resulted  in  an  architecture 
that  support  the  most  frequently  occurring  operations  in  an 
effective  manner.  The  high  computational  throughput  is  due 
to  the  good  match  of  the  RISC  architecture  with  the  needs  and 
constraints  of  VLSI  single-chip  processors. 
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SPECIFICATIONS 

CHIP  A 

CHIP  B 

DRAWN  GATE  LENGTH 

4  fj.rr\ 

3  ^im 

|  DIE  SIZE  (mil2) 

228  *  406 

171  s  304 

REGISTER  CELL  AREA 

4.6  mil2 

4.6  mil* 

CLOCK  RATE 

8  MHz 

12  MHz 

POWER  DISSIPATION 

1.25  W 

1.83  W 

REGISTER  FILE 

138  x  32  bits 

TRANSISTOR  COUNT 

40706 

PIN  COUNT 

62 

DESIGN  TIME 

2.8  man 

-years 

TABLE  I— Chip  specification*. 


FIGURE  3— Dual-port  register  file  read  circuitry. 
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REDUCED  INSTRUCTION  SET  COMPUTER  ARCHITECTURES  TOR  VLSI 


Emmanuil- Mona  lis  Georg  a  Katavenis 

ABSTRACT 

Integrated  circuits  offer  compact  and  low-cost  implementation  of  digital 
systems,  and  provide  performance  gains  through  their  high-bandwidth  on-chip 
communication.  When  this  technology  is  used  to  build  a  general-purpose  von 
Neumann  processor,  it  is  desirable  to  integrate  as  much  functionality  as  possi¬ 
ble  on  a  single  chip,  so  as  to  minimize  off-chip  communication.  Even  in  Very 
Large  Scale  Integrated  (VLSI)  circuits,  however,  the  transistors  available  on  the 
limited  chip  area  constitute  a  scarce  resource  when  used  for  the  implementa¬ 
tion  of  a  complete  processor  or  even  computer,  and  thus,  they  have  to  be  used 
effectively.  This  dissertation  shows  that  the  recent  trend  in  computer  architec¬ 
ture  towards  instruction  sets  of  increasing  complexity  leads  to  inefficient  use  of 
those  scarce  resources.  We  investigate  the  alternative  of  Reduced  Instruction 
Set  Computer  (RISC)  architectures  which  allow  effective  use  of  on-chip  transis¬ 
tors  in  functional  units  that  provide  fast  access  to  frequently  used  operands  and 
instructions. 

In  this  dissertation,  the  nature  of  general-purpose  computations  is  studied, 
showing  the  simplicity  of  the  operations  usually  performed  and  the  high  fre¬ 
quency  of  operand  accesses,  many  of  which  are  made  to  the  few  local  scalar 
variables  of  procedures.  The  architecture  of  the  RISC  I  and  11  processors  is 
presented.  They  feature  simple  Instructions  and  a  large  multi-window  register 
file,  whose  overlapping  windows  are  used  for  holding  the  arguments  and  local 
scalar  variables  of  the  most  recently  activated  procedures.  In  the  framework  of 


the  RISC  project,  which  has  been  a  large  team  effort  at  U.  C.  Berkeley  for  more 
than  three  years,  a  RISC  II  nMOS  single-chip  processor  was  implemented,  in  col¬ 
laboration  with  R.  Sherburne.  microarchitecture  is  described  and  evaluated, 
followed  by  a  discussion  .of  the  debugging  and  testing  methods  used.  Future 
VLSI  technology  will  allow  the  integration  of  larger  systems  on  a  single  chip.  The 
effective  utilization  of  the  additional  transistors  is  considered,  and  it  is  proposed 
that  they  should  be  used  in  implementing  specially  organized  instruction  fetch- 
and-sequence  units  and  data  caches. 

The  architectural  study  and  evaluation  of  RISC  II,  as  well  as  its  design,  lay¬ 
out.  and  testing  after  fabrication,  have  shown  the  viability  and  the  advantages  of 
the  RISC  approach.  The  RISC  II  single-chip  processor  looks  different  from  other 
popular  commercial  processors:  it  has  less  total  transistors,  it  spends  only  107, 
of  the  chip  area  for  control  rather  than  one  half  to  two  thirds,  and  it  required 
about  five  times  less  design  and  lay-out  effort  to  get  chips  that  work  correctly 
and  at  speed  on  first  silicon.  And,  on  top  of  all  that,  RISC  II  executes  integer, 
high  level  language  programs  significantly  faster  than  these  other  processors 
made  in  similar  technologies. 


C.  H.  Sfequin  (Committee  Chairman) 
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CHAPTER  1: 


INTRODUCTION. 


Even  in  Very  Large  Scale  Integrated  (VLSI)  circuits,  the  number  of  transis¬ 
tors  available  on  a  single  chip  must  be  considered  a  limited  resource.  In  the 
course  of  the  "Reduced  Instruction  Set  Computer"  (RISC)  project  at  U.  C.  Berke¬ 
ley,  it  was  found  that  hardware  support  for  complex  instructions  is  not  the  most 
effective  way  of  utilizing  the  transistors  in  a  VLSI  processor.  In  chapter  1,  the 
RISC  concept  is  presented  first,  followed  by  an  overview  of  the  Berkeley  RISC 
project,  and  some  notes  on  the  organization  of  this  thesis. 


1.1  The  RISC  Concept: 

Effective  Use  of  Scarce  Hardware  Resources. 

Increasing  the  size  or  complexity  of  a  digital  circuit  may  either  enhance  or 
deteriorate  the  overall  system  performance,  depending  on  how  judiciously  the 
added  complexity  is  chosen.  The  Berkeley  RISC  project  has  demonstrated  the 
viability  of  general  purpose  computers  with  simple  instruction  sets.  It  has 
brought  concrete  evidence  showing  the  non-optimal  utilization  of  silicon 


resources  in  most  contemporary  single-chip  processors,  due  to  the  increased 
complexity  of  their  instruction  sets. 


1.1.1  Size,  Complexity,  and  Speed. 

Increasing  the  size  or  complexity  of  a  digital  circuit  may  lead  to  better  sys 
tern  performance.  For  example: 


•  A  32-bit  adder  will  allow  a  32-bit  processor  to  operate  at  higher  speed 
than  a  16-bit  adder,  used  twice  for  each  addition,  would  allow  it  to 
operate. 

•  Overlapping  instruction  execution  with  the  fetching  of  the  next 
instruction  reduces  the  execution  time  of  programs  with  an  average 
number  of  jump  instructions. 

•  Including,  for  example,  4  registers  into  a  general-purpose  CPU  will 
give  a  much  better  performance  than  if  only  2  registers  were 
included,  —  if  the  compiler  can  take  advantage  of  additional  regis¬ 
ters. 


All  these  are  examples  of  cases  where  the  increase  in  size  or  complexity  is  used 
to  allow  parallel  execution  of  common  parallel  operations,  or  to  provide  faster 
access  to  frequently  used  operands. 

On  the  other  hand,  increasing  the  size  or  complexity  of  a  digital  circuit  can 
also  have  negative  effects  on  its  performance: 


•  A  larger  size  entails  longer  wire  delays. 

•  More  gates  mean  less  power  is  available  per  gate,  resulting  in 
reduced  driving  strength. 

•  A  more  complex  mode  of  operation  usually  means  interposing  more 
circuit  elements  in  the  path  of  information  flow:  for  example,  addi¬ 
tional  or  larger  input  multiplexors,  increased  output  fanout,  or  more 
circuits  hanging  off  busses.  This  inevitably  reduces  the  maximum 
possible  operating  speed. 


In  VLSI  systems  this  trade-off  between  speed  and  size/complexity  of  a  cir¬ 
cuit  is  more  pronounced  than  it  is  in  the  previous-generation  systems  built  from 
TTL  SSI/MSI  parts.  The  following  tables  show  typical  capacitances  and  delay 


times  in  the  TTL  technology  and  in  the  NMOS  process  by  which  RISC  H  was  fabrl 
cated: 


Typical  Delays 

of  TTL  inverters  (7404)  or  3-state  buffers  (74240): 

S-series 

LS-series 

cy.  =  15pF 

3  ns 

10  ns 

C,=50pF 

5  ns 

15  ns 

oUpmNV 

Typical  Delays 

!0S  inverters  or  buffers: 

high-power 

low-power 

Ct.=0.lpF 

3  ns 

10  ns 

Cr=Z.5pF 

15  ns 

60  ns 

In  VLSI  MOS  technologies,  the  gate  delays  vary  over  much  wider  ranges  than  in 
discrete  technologies.  The  reason  is  twofold.  On  the  one  hand,  the  load  capaci¬ 
tance  significantly  influences  the  delay  time  of  MOS  gates,  while  for  discrete 
parts  a  large  portion  of  the  delay  is  due  to  the  internal  circuitry  and  to  the 
package  and  does  not  depend  so  strongly  on  Ci.  On  the  other  hand,  custom  MOS 
offers  much  wider  design  choices  in  terms  of  size  of  devices,  type  of  circuits 
available,  and  size  of  load  to  be  driven.  Thus,  the  dependence  of  system  speed 
on  size  and  complexity  is  much  more  direct  in  VLSI  than  it  is  in  older,  discrete 
technologies. 

Another  important  factor  in  VLSI  system  design  is  the  large  difference  in 
available  bandwidth  between  on-chip  and  off-chip  communication.  In  today’s 
(1983)  technology,  one  may  typically  see  transfers  on  the  order  of  200  bits  every 
20  ns  on-chip,  versus  only  50  bits  going  through  the  chip  periphery  every  50  ns. 
This  character  of  the  chip  periphery  as  communications  bottleneck  makes  it 
desirable  to  pack  as  much  functionality  as  possible  into  the  restricted  area  of  a 
single  chip.  In  this  context,  an  increase  of  the  size  and  complexity  of  one  circuit 
feature  may  only  be  achieved  at  the  expense  of  another. 


Taking  these  trade-offs  between  size/complexity  and  speed  properly  into 
account,  leads  to  hierarchically  organized  systems,  where  the  inner  units  are 
physically  smaller  and  support  the  most  frequent  operations.  The  system's 
architect  has  the  important  role  of  selecting  the  functions  to  be  supported  at 
the  various  levels  of  the  system's  hierarchy.  This  is  particularly  important  in 
VLSI  system  design,  where  the  spectrum  of  possible  choices  is  wider  and  more 
continuous  than  it  is  in  systems  employing  discrete  technology. 


1.1.2  Recent  Trends,  and  the  RISC  Alternative. 

A  general  trend  in  computers  today  is  to  increase  the  complexity  of  archi¬ 
tectures  commensurate  with  the  increasing  potential  of  implementation  techno¬ 
logies,  as  exemplified  by  the  complex  successors  of  simpler  machines.  Com¬ 
pare.  for  example,  the  DEC  VAX-11  to  the  PDP-11  [Stre7B],  the  IBM  System/38  to 
the  System/3  [Utle7B],  and  the  Intel  iAPX-432  to  the  B0B6  [TyneBl]  [0rgaB2]. 

Following  the  discussion  made  in  the  previous  subsection,  it  is  necessary  to 
study  the  overall  effect  of  such  complex  instruction  sets  on  performance.  For  a 
VLSI  system,  does  this  approach  lead  to  an  effective  utilization  of  the  scarce  sili¬ 
con  resources?  In  1980,  the  "Reduced  Instruction  Set  Computer"  (RISC)  project 
was  started  at  U.  C.  Berkeley,  with  the  goal  of  investigating  an  alternative  to  this 
trend.  The  hypothesis  was  that,  since  complex  instructions  are  rarely  used  by 
actual  programs,  their  inclusion  into  the  processor’s  instruction  set  has  more 
negative  effects  on  overall  performance  than  it  has  positive  ones.  On  the  other 
hand,  the  frequent  program  accesses  to  operands  justify  better  support  than  is 
normally  available  in  traditional  architectures.  A  third  consideration  was  that  a 
simplifyed  architecture  is  important  in  a  field  of  such  a  rapidly  changing  tech¬ 
nology,  because  it  leads  to  a  short  design  and  debugging  time,  thus  allowing 
quick  exploitation  of  the  new  technologies. 
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From  these  considerations  the  Berkeley  RISC  Architecture  was  derived.  It 
specified  a  general  purpose  processor  with  simple  instructions  and  with  many 
registers  organized  in  multiple  register  banks.  The  RISC  project  has  now  (1983) 
demonstrated  not  only  the  viability  but  also  the  very  definite  advantages  of  this 
approach.  The  judicious  choice  of  the  instruction  set  was  a  key  to  this  success. 
First,  the  most  necessary  and  frequent  operations  (instructions)  in  programs 
were  identified.  Then,  the  data-path  and  timing  required  for  their  execution  was 
identified.  And  last,  other  frequent  operations  (instructions),  which  could  also 
fit  into  that  data-path  and  timing,  were  included  into  the  instruction  set. 

During  the  definition  of  the  RISC  architecture,  its  implementation  was  kept 
in  mind  at  adl  times.  The  resulting  architecture  lies  on  a  "knee  of  the  curve"  of 
the  speed-versus-complexity  trade-off.  A  significant  number  of  commonly-used 
instructions  is  included  in  the  ISP  description  of  RISC  (§  3.1);  all  of  them  are 
implementable  with  a  simple  data-path  and  timing  scheme.  Including  more 
instructions  into  the  ISP  would  have  required  significant  changes  to  the 
hardware,  thus  slowing  down  the  cycle  time. 

U.  C.  Berkeley  is  not  the  only  place  where  research  on  simple  instruction 
sets  is  going  on.  Similar  investigations  ere  being  carried  out  at  IBM  Watson 
Research  Center  in  the  801  project  [Radi82],  and  at  Stanford  University  in  the 
MIPS  project  [Henn82]  [Henn83]. 
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Evolution  of  the  Berkeley  RISC  Project. 


Table  1.2.1  shows  the  key  steps  in  the  history  of  the  Berkeley  RISC  project. 
Two  faculty  members  and  about  two  dozen  graduate  students  have  been  involved 
in  this  three-year  project.  The  author  of  this  dissertation  has  been  heavily 
involved  in  it,  starting  with  the  architectural  studies  in  the  spring  of  19B0;  subse¬ 
quently  his  main  concern  was  focused  on  the  definitions  of  the  micro¬ 
architectures  for  both  NMOS  versions  and  on  design,  layout,  and  debugging  of 
RISC  II. 


Table  1.2.1:  History  of  the  RISC  Project. 

Period 

Activity 

People 

wint.80 

RISC  idea 

Patterson,  Sfequin 

spr.80 

Architectural  Studies 

Patterson,  15  grad.  stud. 

sumr.80 

Architecture  Definition 

Patterson.  4  grad.  stud. 

sumr.-fall.80 

Compiler,  Assem.,  SimuL 

Campbell,  Tamir 

sumr.-fall.8Q 

RISC  I  Micro-Architecture 

Katevenis 

wint.81 

RISC  II  Micro* Architecture 

Katevenis 

wint.-spr.81 

RISC  I  Design  &  Layout 

Fitzpatrick,  Foderaro, 
Peek.Peshkess.VanDyke 

sumr.81-spr.82 

RISC  I  fabrication 

MOSIS,  XEROX 

sumr.82 

RISC  I  tested 

Foderaro,  VanDyke 

spr.-sumr.82 

RISC  I  board 

VanDyke 

wint.81-wint.83 

RISC  II  Design  &  Layout 

Katevenis,  Sherburne 

spr.83 

RISC  II  fabrication 

MOSIS,  XEROX 

sumr.83 

RISC  II  tested 

Katevenis,  Sherburne 

1981-82 

RISC/E  ECL  Paper  Design 

Beck,  Davis,  et.al. 

spr.-fall.82 

I-cache  Design  &  Layout 

Hill,  Lioupis, 

Nyberg,  Sippel 

spr.83 

I-cache  fabrication 

MOSIS.  XEROX 

sumr.83 

I-cache  tested 

lioupis,  Hill 

fall.  82-fall.  83 

CMOS  RISC  Layout  Study 

Takada 

wint.83-ongoing 

RISC  II  microcomputer 

Lioupis,  Campbell 

The  Berkeley  RISC  architecture  was  defined  in  I960,  after  extensive  archi¬ 
tectural  studies  performed  during  a  graduate  course.  These  included  the  meas¬ 
urement  of  several  program  parameters,  such  as  the  number  of  various  state¬ 
ments  and  addressing  modes,  usage  of  local  scalars,  and  procedure  nesting 
depth.  The  measurements  were  done  mostly  in  C,  and  also  in  Pascal,  and  did 
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not  include  any  numeric  computations  program.  This  is  the  applications  area 


for  which  the  RISC  architecture  was  designed.  The  author  of  this  dissertation 


contributed  to  the  above  studies  with  a  preliminary  look  at  the  data-path  and 


the  timing  for  such  an  architecture.  Once  the  architectural  design  was  finalized. 


he  defined  a  micro-architecture  to  implement  it.  This  was  described  in  detail  in 


[Kate80],  and  was  subsequently  adopted  by  a  group  of  5  graduate  students  who 


designed,  laid-out,  and  debugged  the  corresponding  NMOS  IC  in  only  six  months. 


It  was  originally  called  "RISC  I  Gold",  and  later  on  simply  "RISC  I".  Its  very  short 


design  time  was  due  to  the  simplicity  of  the  architecture  [Fitz8l].  It  was  fabri¬ 


cated  and  tested;  the  chips  were  functionally  correct,  but  slower  than  intended 


by  about  a  factor  of  4  [FoVP82].  This  was  due  to  a  lack  of  tools,  at  that  time, 


that  could  find  all  the  critical  timing  paths  in  a  simulation  of  the  whole  chip.  A 


RISC  I  board,  with  memory  and  I/O  around  the  CPU  chip,  was  built  by  VanDyke 


and  used  to  demonstrate  the  execution  of  small  programs. 


In  parallel  with  the  design  of  RISC  I,  the  present  author  defined  a  second. 


more  ambitious  micro-architecture  for  the  same  processor  architecture,  and 


subsequently  implemented  it,  together  with  Robert  Sherburne,  by  designing, 


laying-out,  and  debugging  a  second  NMOS  IC.  This  was  originally  called  "RISC  I 


Blue”,  and  later  on  "RISC  II".  It  was  fabricated  and  tested  in  1983.  The  chips  are 


functionally  correct  and  work  very  close  to  predicted  speed.  RISC  II  occupies  25 


%  less  silicon  area  than  RISC  I,  even  though  it  has  75  7,  more  registers.  This  was 


made  possible  by  reducing  the  number  of  busses  that  go  through  the  register 


file  from  three  to  two,  whicL.  led  to  a  much  more  compact  register  cell.  To  avoid 


a  resulting  performance  loss,  an  additional  pipeline  stage  was  used,  as  suggested 


by  Lloyd  Dickman.  Overall,  the  circuit  design  and  layout  for  RISC  II  was  done 


with  careful  attention  to  performance. 


L 
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Other  parts  of  the  RISC  project  have  been  going  on  in  parallel:  A  detailed 
paper  design  was  made  for  RISC/E,  a  RISC  CPU  and  cache  memory  made  out  of 
SSI/MSI  ECL  IC's  [Blom83].  Another  group  designed  and  laid-out  an  Instruction- 
Cache  chip  for  the  RISC  II  CPU  [Patt83].  Cache  chips  were  fabricated  and 
tested;  they  were  found  to  be  functionally  correct  and  to  work  very  close  to  the 
predicted  speed.  A  CMOS  version  of  the  RISC  II  micro-architecture  was  studied 
by  M.  Takada  by  designing  and  laying-out  the  data-path  and  most  of  the  control. 
Finally,  there  is  ongoing  work,  by  Lioupis  originally  [Liou33]  and  by  Campbell 
now,  for  designing  and  building  a  micro-computer  around  the  RISC  II  CPU  and  I- 
cache  chips. 
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Thesis  Organization. 


Since  it  is  crucial  for  an  architect  to  know  which  are  the  most  frequently 
used  operations  (instructions),  chapter  2  reviews  the  relevant  literature  on  pro¬ 
gram  measurements  and  complements  it  by  a  study  of  program  properties  done 
with  a  different  method,  providing  yet  another  point  of  view. 

The  next  three  chapters  deal  with  the  architecture,  micro-architecture, 
design,  layout,  debugging,  and  testing  of  RISC  II.  They  show  how*  the  Berkeley 
RISC  architecture  fits  into  the  concept  of  effective  utilization  of  the  hardware 
resources,  and  they  present  the  most  important  experiences  gained  and  conclu¬ 
sions  reached  from  the  whole  cycle  of  micro-architecture  definition  to  design, 
layout,  debugging,  and  testing.  It  is  appropriate  here  to  make  a  clarification  as 
to  the  terminology  used.  The  term  "RISC  architecture"  is  general  and  refers  to 
any  architecture  inspired  by  the  "RISC  concept"  as  presented  in  section  1.1. 
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The  term  "the  Berkeley  RISC  architecture"  refers  to  the  specific  RISC  architec¬ 
ture  defined  at  U.C.Berkeley  in  1980  and  implemented  by  RISC  I  and  RISC  II; 
sometimes  we  may  abusively  use  the  shorter  term:  "the  RISC  architecture". 

The  thesis  concludes  with  a  projection  into  the  future.  Soon,  VLSI  chips  will 
have  significantly  more  transistors  than  were  used  by  RISC  I  or  RISC  II.  What  will 
these  additional  transistor  be  used  for?  Chapter  6  proposes  additional  hardware 
organizations,  always  within  the  framework  of  simple  instruction  sets,  which  will 
make  effective  use  of  those  transistors  for  speeding  up  the  execution  of 
general-purpose  computations. 
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COMPUTATIONS. 
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In  the  design  of  a  computer  system,  two  issues  must  be  studied  carefully: 

(1)  FUNCTION:  What  is  the  purpose  of  the  computer  system?  What  is  the  nature 
of  the  computations  it  will  perform?  What  are  the  necessary  features  that  will 
enable  it  to  perform  those  computations  with  high  efficiency? 

(2)  COST:  Can  the  desirable  architectural  features  be  implemented  at  a  reason¬ 
able  cost  and  with  a  reasonable  performance,  in  a  particular  technology?  What 
are  the  trade-offs  imposed  by  the  constraints  of  a  given  implementation  technol¬ 
ogy? 

This  chapter  focuses  on  the  first  question  of  what  it  is  that  computer  sys¬ 
tems  usually  do,  leaving  the  bulk  of  the  discussion  on  implementation  issues  for 
the  next  chapters.  We  are  interested  in  "general-purpose  computer  systems". 
Although  it  is  difficult  to  define  this  term,  we  use  it  to  refer  to  systems  not 
biased  towards  the  execution  of  a  particular  algorithm,  and,  specifically,  sys¬ 
tems  that  execute  a  mix  of  word  processing,  data  base  applications,  mail  and 
communications,  compilations,  CAD,  control,  and  numerical  applications.  The 
chapter  will  assemble  a  picture  of  the  nature  of  such  "general-purpose" 

2. 
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computations,  by  collecting  program  measurements  from  the  literature,  and  by 
studying  the  critical  loops  of  some  representative  programs.  The  resulting  pic¬ 
ture  will  be  used  in  the  next  chapters. 


2.1  Goal  and  Methods 

of  Program  Measurements. 


The  main  vehicle  for  a  qualitative  and  quantitative  understanding  of  the 
nature  of  computations  is  the  measurement  of  the  important  properties  on 
some  real  programs.  It  is  very  difficult  for  such  a  study  to  be  made  abstractly  — 
not  in  connection  with  a  particular  model  of  computers  and  computations, 
because  real  programs  and  programming  languages  are  written  and  defined 
with  a  particular  model  in  mind,  and  because  the  properties  to  be  measured 
depend  on  this  model. 

Throughout  this  dissertation,  a  von  Neumann  model  of  computers  and  com¬ 
putations  is  assumed.  Programs  written  in  corresponding  languages  are  con¬ 
sidered  in  this  chapter.  C  and  FORTRAN  program  fragments  are  studied,  and 
measurements  from  the  literature  are  reported,  which  were  collected  by  looking 
at  programs  written  in  FORTRAN,  XPL,  PL/I,  Algol,  Pascal,  C,  BLISS,  Basic,  and 
SAL.  This  section  identifies  the  main  properties  of  computations  which  are 
important  in  the  design  of  von  Neumann  architectures,  and  lists  tools  and 
methods  for  their  measurement. 
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2.1.1  Architecturally  Important  Properties  of  Computations. 

In  the  von  Neumann  model,  computations  are  performed  by  sequentially 
executing  operations  on  operands  which  are  kept  in  a  storage  device.  The 
sequence  of  operations  is  dynamically  controlled  by  operand  values.  Thus,  the 
properties  of  computations  that  will  interest  us  are: 


> 


> 


► 
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•  Operands  used.  Their  type,  size,  structure,  and  the  nature  of  their  usage 
determines  the  storage  organization  for  keeping  them  and  the  addressing 
modes  for  accessing  them.  In  particular: 

•  Constant  or  variable  operands. 

•  Types  of  operands:  integers,  floating-point,  characters,  pointers. 

•  Structure  of  operands:  scalars,  arrays,  strings,  structures  of  records. 

•  Declaration  of  operands:  globals,  procedure  arguments,  procedure  locals. 

•  Number  of  operands,  sizes,  and  frequency  of  accesses  for  the  above 
categories. 

•  Amount  and  nature  of  locality-of-reference,  possibly  determined  individually 
for  each  one  of  the  above  categories,  for  example,  for  scalars,  arrays, 
(dynamic)  structures,  globals,  and  procedure  activation  records. 

•  Operations  performed.  These  will  determine  the  required  operational  units, 
and  their  connection  to  the  storage  units.  The  relative  frequency  of  opera¬ 
tions  such  as  the  ones  listed  below  is  important,  and  the  variation  of  those  fre¬ 
quencies  with  the  operands’  categories  is  also  of  interest. 

•  Test,  compare,  add,  subtract,  multiply,  divide,  and  so  on. 

•  Operation  type,  such  as  integer,  floating-point,  or  string. 

•  Higher  level  operations,  such  as  I/O,  buffer,  list,  and  so  forth. 

•  Execution  sequencing.  This  will  determine  the  control  and  pipeline  organiza¬ 
tion: 

•  Control  transfers:  conditional/unconditional  jumps,  calls,  returns.  What  is 
their  frequency,  distance,  conditions,  predictability,  and  earliness  of  condi¬ 
tion  resolution. 

•  Amount  and  nature  of  extractable  parallelism.  This  is  a  very  general  and 
important  question;  for  von  Neumann  architectures,  we  are  interested  in 
low-level  parallelism. 


While  quantitative  measurements  are  essential,  the  large  number  of  proper¬ 


ties  to  be  measured  --  especially  if  correlation  among  them  is  also  studied  — 
makes  a  qualitative  understanding  of  the  global  picture  equally  important. 


2.1.1 
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Methods  for  both  kinds  of  analysis  are  presented  below. 

2.1.2  Static  and  Dynamic  Measurements. 

Program  measurements  are  usually  collected  by  running  the  program 
under  study  through  a  suitable  filter,  or  by  executing  it  in  a  suitable  environ¬ 
ment.  In  both  cases  the  result  of  this  processing  is  a  count  of  the  numbers  of 
times  that  some  feature  has  appeared  or  that  some  particular  property  has  held 
true  in  the  text  of  the  program  or  in  its  execution. 

Measurements  referring  to  the  text  of  a  program  are  called  static.  They 
give  no  useful  information  on  performance,  because  they  are  not  weighted  rela¬ 
tive  to  the  number  of  times  each  statement  was  executed.  They  can  show  the 
size  of  storage  required  for  the  machine  code  and  for  the  statically  allocated 
objects,  and  they  can  show  what  the  compiler  has  to  deal  with.  Under  crude 
assumptions,  the  static  characteristics  of  programs  can  also  give  some  indica¬ 
tion  on  their  dynamic  behaviour. 

Measurements  referring  to  the  execution  of  a  program  are  called  dynamic. 
Execution  of  the  program  requires  previous  compilation  into  object  code  for 
some  machine,  except  if  expensive  interpretation  is  used.  Thus,  dynamic  meas¬ 
urements  usually  refer  to  machine  rather  than  source  code,  introducing  another 
-  often  unwanted  -  parameter  into  the  study.  Machine  code  can  be  correlated 
back  to  source  code,  so  that  dynamic  measurements  at  the  source  level  can  be 
inferred.  However,  this  correlation  is  not  always  easy  or  precise. 

Static  and  dynamic  program  measurements  have  frequently  appeared  in 
the  literature,  and  have  also  been  collected  early  in  the  RISC  project  (spring 
1990).  Section  2.2  reviews  some  of  them. 


2.1.2 


2.1.3  Source-Code  Profiling  and  Studying. 


Because  the  list  of  important  properties  of  computations  is  very  long,  and 
because  several  of  them  are  difficult  to  quantify  or  to  measure,  the  static  and 
dynamic  program  measurements  have  some  limitations.  There  is  another 
method  of  looking  at  the  nature  of  computations  which  is  less  quantitative  but 
more  qualitative,  and  which  can  complement  these  measurements  or  give  a 
better  idea  of  what  specific  other  measurements  should  be  taken.  That  method 
is  to  carefully  study  the  source  code  of  a  program  and,  if  possible,  the  underly¬ 
ing  algorithms,  concentrating  on  those  portions  of  it  which  account  for  most  of 
the  execution  time. 

It  has  been  observed,  time  and  again,  that  programs  spend  most  of  their 
execution  time  in  small  portions  of  their  code,  the  so-called  "critical  loops". 
This  makes  it  feasible  and  worthwhile  to  study  those  portions  in  detail,  to  under¬ 
stand  the  nature  and  properties  of  the  computation  that  is  carried  out.  The 
critical  loops  can  be  identified  by  profiling  the  program  during  execution. 
Profiling  is  the  dynamic  measurement  of  how  much  of  the  execution  "cost"  is 
spent  at  each  place  in  the  program’s  code.  The  "cost"  may  be: 

•  time  spent, 

•  number  of  source-code  lines  executed, 

•  number  of  memory  accesses,  and  so  forth. 

In  section  2.3  we  will  study  some  critical  loops  that  have  been  identified  by 
other  researchers.  Section  2.4  studies  some  more  critical  loops,  which  were 
identified  by  this  author  using  two  profiling  systems.  The  first  was  the  standard 
profiling  facility  of  UNIX:  compilation  using  the  -p  or  -pg  switch,  execution,  and 

#  •  P  _ 

then  interpretation  of  the  results  by  the  prof  or  gprof  program.  This  method 
arranges  that  the  program-counter  of  an  executing  process  be  sampled  at  "ran¬ 
dom"  intervals  (on  clock  interrupts,  every  1/B0th  of  a  second).  The  sampled 
W 


value  is  used  to  determine  which  procedure  was  executing  at  that  time.  If  a  pro¬ 
gram  runs  for  a  long  time,  the  above  samples  can  be  used  to  construct  esti¬ 
mates  of  how  much  time  was  spent  in  each  of  the  program’s  procedures.  There 
is  no  straightforward  way  to  find  out  the  time  spent  in  executing  any  smaller 
program  portions. 

The  second  profiling  system  that  was  used,  for  programs  written  in  C, 
belongs  to  Bell  Laboratories  (Murray  Kill),  and  was  used  under  special  authoriza¬ 
tion  [Wein].  It  counts  the  number  of  times  that  each  source-code  line  is  exe¬ 
cuted  (but  gives  no  indication  as  to  how  long  its  execution  takes).  A  special  ver¬ 
sion  of  the  C  compiler  is  used,  which  inserts  code  at  appropriate  locations  to 
increment  appropriate  counters.  At  the  end  of  execution  the  counts  are  saved 
in  a  file.  Another  program  is  then  invoked  to  correlate  those  counts  with  the  ori¬ 
ginal  source  code,  and  to  generate  an  annotated  program  listing  f. 


2.2  Review  of  some  Program  Measurements 

from  the  Literature. 

In  this  section  interesting  program  measurements  from  the  literature  are 
reviewed.  Measurements  on  all  properties  mentioned  in  section  2.1.1  are  not 
present  here,  because  some  of  them  either  have  not  received  enough  attention 
in  the  literature,  or  were  difficult  to  measure.  The  measurements  were  selected 
from: 

t  the  count  is  not  always  what  one  would  expect  for  hnes  like:  "  |  else  [  ",  The  listings  in 
section  2.4  have  been  corrected  by  hand  in  those  situations. 
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[AlWo75]:  Alexander  and  Wortman  collected  static  and  dynamic  measurements 
from  19  programs  (mostly  compilers),  written  in  XPL  and  executed 
on  the  IBM/380  architecture. 

[Elsh78]  Elshoff  presented  static  measurements  of  120  commercial,  produc¬ 
tion  PL/I  programs  for  business  data  processing. 

[HaKeSO] [TaSe83] 

Halbert  and  Kessler,  in  their  study  of  multiple  overlapping  windows 
early  during  the  RISC  project,  collected  dynamic  measurements  on 
the  number  of  arguments  and  local  scalars  per  procedure,  and  on 
the  locality  property  of  procedure-nesting-depth.  They  measured 
the  C  compiler,  the  Pascal  interpreter,  the  troff  typesetter,  and  6 
other  smaller  non-numeric  programs  (all  written  in  C).  Tamir  and 
Sfequin  collected  some  more  dynamic  data  on  the  locality  of  nesting 
depth,  measuring  the  RISC  C  compiler,  the  towers-of-Hanoi  program, 
and  the  Puzzle  program  (all  written  in  C). 

[Lund77]  Lunde  used  the  concept  of  "register-lives’'  in  his  measurements.  He 
analyzed  half  a  dozen  numeric-computation  programs  written  in  5 
different  HLL’s  (2  FORTRAN  versions,  Basic.  Algol,  BLISS),  plus  some 
compilers,  all  running  on  a  DECsystemlO  architecture. 

[Shus78]  Shustek  studied  the  usage  made  of  the  PDP-11  addressing  modes,  by 
statically  measuring  10,000  lines  of  code  of  an  operating  system. 

[PaSe82]  Patterson  and  Sfequin  presented  the  most  important  measurements 
collected  during  the  early  stages  of  the  RISC  project,  in  spring  1980, 
in  collaboration  with  E.  Cohen  and  N.  Soifler.  Measurements  are 
dynamic,  and  were  collected  from  compilers,  typesetters,  and  pro¬ 
grams  for  CAD,  sorting,  and  file  comparison.  Four  of  those  were  writ¬ 
ten  in  C,  and  the  other  four  in  Pascal. 

[Tane78]  Tanenbaum  published  static  and  dynamic  measurements  of  HLL  con¬ 
structs,  collected  from  more  than  300  procedures  used  in  operating- 
system  programs  and  written  in  a  language  that  supports  structured 
programming  (SAL). 


2.2.1  Measurements  on  Operations. 

The  operations  performed  by  programs  are  the  most  frequent  object  of 
measurements  in  the  form  of  statement  types  (source  level)  or  opcodes 
(machine  level).  The  following  table  summarizes  such  measurements. 


2.2.1 


Property: 


aimcauv  executed  instructions : 


moves  between  registers  and  memory 
branching  instructions 
fixed-point  add/sub’s 


load,  load  address 

(more  than  normal,  due  to  360  architecture) 
store 
branch 
compare 


Statically  counted  HLL  statements: 


assignments 
if 

caU 


Measurem. 


Reference: 


[Lund77,p.l49] 
(numeric  k 
compilers) 


[AlWo75] 
(mostly  compil. 
iri  XPL 
on  IBM/360) 


[A3 

(rr 

lostly  comp 

11. 

in 

XPL) 

amtcally  executed  HLL  statements: 


assignments 

if 

caU/return 

loops 


42  ±  12  % 
36  ±  15  % 
14  ±  4  % 

4  ±  3  % 


[PaSeB2] 
(non-numeric, 
in  C  and  Pascal) 


.weighted  with  the  number  of  machine  instructions  executed  for  each: 


loops 

37  ±5  X 

[PaSeB2] 

call/return 

32  ±  12  % 

(non-numeric, 

if 

16  ±  7% 

in  C  and  Pascal) 

assisn 

13  ±4% 

...weighted  with  the  number  of  memory  accesses  necessary  for  each: 


caU/return 

loops 

assign 

if 


45  ±  16  %  [PaSeB2j 

30  ±  4  %  (non-numeric. 

15  ±  5  %  in  C  and  Pascal) 
10  ±4% 


More  on  “procedure  calls 


procedure  caUs  as  percentage 
of  dynamically  executed  HLL  statements 
procedure  caU  administration 

as  percentage  of  execution  time _ 

an  amazing  exception  case: 

procedures  defined  within  100  K  statem. 
perc.  of  calls  relative  to  all  statem. 


83  (only!) 
2  JS  (!) 


[Tane7B]  (O.S.. 
structured  pro 


[Lund77,p.l5l] 
(BLISS  compiler 


[Elsh76J  (PL/I 
business  prog, 
static) 


Other  f re auent  high-level  operations: 


•  vector  operations  (inner  product,  move,  sum,  search,...)  [Lund77J 

•  character-string  ops  (table-controlled  substitute,  delete,  branch) 

•  loop  control  (increment  a  res.,  compare  it  to  another  res.,  and  branch 


Property:  Measur 

Jump  distance,  measured  dynamically: _ 

<  128  bytes  55  % 

<  16  Kbytes _ _ 93  % 

Jump  conditions,  measured  dynamically: _ 

unconditional  jumos  as  %  of  all  jumps  55  % 

..."the  comparison  of  two  non-zero  values 

is  about  twice  as  common  as  comparison  with  zero". 


Expressions,  register  lives: _ 

one-term  expressions  in  assignments!  66  % 

two-term  expressions  in  assignments! _ 20  % _ 

operators  per  expression  (average) _ 0.76 _ 

relative  to  all  register  lives: 

lives  w.  no  arithm.  performed  on  them  50%  (20-90%) 
lives  w.  max!!  integer  add/sub  on  them  25%  (1-70%) 

lives  w.  max!!  integer  mult/div  on  them  5%  (2-20%) 

lives  used  in  floating-point  operations _ 15%  (0-40%) 

lives  used  for  indexing  40%  (20-70%) 


Measurem.:  Reference: 


[AlWo75] 

[AlWo75] 

[Lund77] 


[Tane78] 
(dynamic) 
fAlWo75l  (stat.) 

[Lund77] 
(dynamic, 
numeric  k 
compilers) 
[Lund77] 


f  an  the  right-hand-side  of  assignments. 

ft  "maximum-complexity"  operation  performed  on  the  register, 
where  int-add/sub  <  int- mult/div  <  floating-point-op. 


These  measurements  are  not  very  helpful  in  understanding  the  high-level 


nature  of  computations,  but  they  do  show: 


•  The  importance  of  the  procedure  cedi  mechanism,  since  so  much  time  is  spent 
in  it. 

•  The  importance  of  the  sequencing  control  mechanism  (compare  and  branch), 
since  loops  and  if’s  are  so  frequent. 

•  The  importance  of  simple  arithmetic  and  of  addressing,  accessing,  and  moving 
operands  around,  since  expressions  are  usually  very  short,  and  since  half  of 
the  operands  appearing  in  registers  ("register  lives"  in  [Lund77])  have  no 
arithmetic  performed  on  them. 


2.2.2  Measurements  on  Operands. 

Measurements  on  the  operands  in  programs  have  not  been  so  frequent  in 
the  literature,  even  though  this  subject  is  very  important.  Lunde  [Lund77] 
measured  on  a  DECsystemlO  that  each  instruction  on  the  average  references  0.5 
operands  in  memory  and  1.4  in  registers  dynamically.  These  figures  depend 
highly  on  the  architecture  and  on  the  compiler,  but  they  do  illustrate, 


Property:  Measurem.: 


amic  percentage  of  operands  (HLL, 


integer  constants  20  ±  7  % 

scalars  55  ±  11  7, 

array /structure  25  ±  14  % 


local-scalar  references  >  BO  7 

as  percentage  of  all  scalar  references _ * _ 


global-array /structure  references  >  90  r 

as  percentage  of  all  arr/str.  references 


Use  of  PDP-1 1  addressing  modes: 


"The  four  most  common  modes  are  perhaps  the  four  simplest”: 
register  32  7, 

indexed  (e.g.  for  fields  of  structures)  17  7. 

immediate  (constants)  15  7, 

PC-relative  (direct  addressing)  11% 

all  others  25  7. 

'The  four  least-used  modes  are  precisely  the  4  memory  indirect 


"Half  of  the  move  instr.  were  moving  something  into  a  register" 
"Half  of  the  compare/add/subtract  instructions 
had  one  of  their  operands  be  an  immediate" 


[PaSe82J 
(non-numeric, 
in  C  and  Pascal 


[PaSe82] 


[PaSe82] 


[Shus78] 

(static, 

O.S.) 


ones  (1%)". 


[Shus7B] 


A  property  that  had  attracted  very  little  attention  in  the  past  is  the  high 
locality  of  references  to  local  scalar  variables.  The  figures  from  [PaSe82]  given 
above  show  that  over  half  of  the  accesses  to  non-constant  values  are  made  to 
local  scalars.  On  top  of  that,  references  to  arrays/structures  require  a  previous 
reference  to  their  index  or  pointer,  which  is  again  a  -  usually  local  -  scalar.  Most 
of  the  time,  the  number  of  local  scalars  per  procedure  is  small. 

Tanenbaum  [Tane?8]  found  that  98  7*  of  the  dynamically  called  procedures 
had  less  than  6  arguments,  and  that  92  %  of  them  had  less  than  6  local  scalar 
variables.  Similar  numbers  were  found  by  Halbert  and  Kessler 


L  /• 

.  *  «  r.  -  4  * 


Procedure  Activation  Records:  [HaKeSO] 

Percentage  of  executed  procedure  calls  with: _ 

compiler,  interpr.  other  smaller  pro- 


>  3  arguments _ 

>  5  arguments 

>  8  words  of  arguments  Sc  locals 

>  12  words  of  arguments  k  locals 


and  typesetter 
0  to  7  % 

============ 

1  to  20  % 

1  to  6  % 


|rams_Xnon^numencj 

_ 0  to  5  % 

0  % 

0  to  6  % 

0  to  3  % 


Thus,  the  number  of  words  per  procedure  activation  is  not  large.  The  fol¬ 
lowing  measurements  show  that  the  number  of  procedure  activations  touched 
during  a  reasonable  time  span  is  not  large  either.  This  establishes  the  locality- 
of-reference  property  for  local  scalars. 


Locality  of  Procedure  Nesting  Depth:  [HaKe80]  [TaSe83] 
Percentage  of  executed  procedure  calls 

_ which  overflow  from  last  span  of  nesting  depths: _ 

(assuming  that  the  span  of  nesting  depths  has  constant  size,  and  that  its  position  moves  by  one 
on  every  over /under-flow;  this  corresponds  to  a  RISC  register  file  with  as  many  windows 
as  the  span  size,  and  with  no  window  reserved  for  interrupts.  See  section  3.2). 

2  compilers,  interpr.  8+1  other  smaller  pro- 
type setter,  Hanoi  grams  (non-numeric) 

span  size  =  4  (4  windows)  8  to  15  % _ 0  to  2.5  % _ 

span  size  =  8  (8  windows)  1  to  3  %  0  to  0.2  % 


Study  of  some  Critical  FORTRAN  Loops 
(collected  mostly  by  Knuth). 


Knuth,  in  [Knut7l],  presents  a  study  of  where  FORTRAN  programs  spend 
most  of  their  time.  The  programs  he  measured  varied  from  text-editing  to 
scientific  number-crunching  programs.  Dynamic  measurements  of  the  HLL 


2.3 
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statements  executed  showed  that: 


•  67%  were  assignments, 

•  one  third  of  those  assignments  were  of  the  type  A=B, 

•  11%  were  IF,  9%  were  GOTO,  3%  were  DO, 

•  3%  were  CALL,  and  3%  were  RETURN, 

•  More  than  25%  of  the  execution  time  was  spent  in  I/O  formatting. 


However,  what  is  most  interesting  for  our  study  is  that  he  gives  the  actual 
code  fragments  where  17  of  those  programs  (chosen  at  random)  spent  most  of 
their  time.  He  used  those  fragments  ("examples”)  to  test  the  effectiveness  of 
various  techniques  for  optimization  of  compiled  code.  We  will  briefly  study  those 
same  examples  from  our  point  of  interest:  understanding  the  nature  of  compu¬ 
tations,  and  in  particular  answering  the  questions  of  section  2.1.1.  The  17  exam¬ 
ples  have  been  classified  in  three  categories  of  array-numeric,  array-searching, 
and  miscellaneous  style  examples.  Their  code  (or  a  summary  of  it)  is  given 
below  in  a  modernized-FORTRAN.  format.  An  eighteenth  example  of  a  critical 
loop,  collected  by  the  author  of  this  dissertation,  was  added  to  the  first 
category.  It  is  the  main  loop  of  a  procedure  that  inverts  a  positive-definite  sym¬ 
metric  matrix.  It  was  Included  in  the  study  after  two  researchers  in  structural 
mechanics  and  in  fluid  dynamics  independently  told  this  author  that  they  felt 


1  B  =  SQRT(A)  ;  K  =  100.  *B  +  1.5  ;  D[i]  =  S[i]  *  T[K] 

Q  =  D[l]  -  D[N] 

do  2  i=2,M,2 

2  Q  =  Q  +  4.*D[i]  +  2.  *D[i+ 1] 

Example  9:  do  2  k=l,M 

do  2  j=l,M 
initialize... 
do  1  i=l,M 

N  =  j  +  j  +  (i-l)*M2;  B  =  A[k,i] 

1  X  =  X  +  B*Z[N]  ;  Y  =  Y  +  B*Z[N-1] 

2  more  computations... 

Example  11:  a  Fast  Fourier  Transform,  It  computes  sums  and  products  of 
floating-point  elements  of  two  linear  arrays.  One  array  is  accessed 
sequentially,  and  the  other  one  with  a  step  of  N. 

Example  12:  a  very  long  inner  loop,  with  counter  arithmetic,  array  accesses 
(many  3-dimensional  arrays,  some  2-  and  1-  dimensional),  and 
floating-point  multiplications  and  additions.  There  is  one  expres¬ 
sion  with  32  operators!  In  spite  of  its  heavy  computation  charac¬ 
ter,  this  program  has  no  more  floating-point  operations  them  it  has 
simple  counter  and  index  operations. 

Example  15:  do  1  j=i,N 

H[i.j]  =  H[i,j]  +  S[i]*S[j]/Dl  -  S[k+i]*S[k+j]/D2 
1  H[j,i]  =  H[i,j] 


Example  17:  do  1  i=l,N 

1  A  =  A  +  B[i]  +  C[k+l] 

Example  -  Matrix  Inversion: 

Figure  2.3.1  shows  the  aforementioned  critical  loop  of  positive- 
definite  symmetric  matrix  inversion,  in  an  abstract  flow-chart 
form. 


All  these  critical  loops  are  of  the  same  style:  They  perform  floating-point 
operations  on  elements  of  arrays.  Two  almost  independent  "processes"  exist. 
First,  array  elements  are  accessed  in  a  regular  fashion,  i.e.  in  an  arithmetic 
progression  of  memory  addresses;  the  loop  control  is  related  to  the  array 
indexes,  and  does  not  depend  on  the  array  data.  The  second  "process"  is  that  of 
doing  the  actual  numerical  data  computations. 
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Figure  2.3.1; 


Main  Loop  of  a  Procedure 
to  Invert  a  Positive-Definite 
Symmetric  Matrix. 


2.3.2  "Array-Searching”  Style  Examples. 

Example  1: 


a  search  for  the  maximum  of  the  absolute  values: 
do  2  j=l,N 

t  =  ABS(  A[i,j] )  ;  If  (t>s)  then  s=t ; 

2  continue 


Example  2: 


a  search  for  a  match: 
do  1  j=38,53 

if  (K[i]==L[j])  then  goto  2 
1  continue 


2.3.2 
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Example  10:  do  1  i=UM 

1  if  (  X[i-l,j]  <  Q  and  X[i,j]  Q  )  then  rore 

Example  13:  a  binary  search: 

1  j  =  (i+k)/2 

if  G==0  then  goto  2 
if  (  X(j]  ==  XKEY  )  then  goto  3 
if  (  X[j]  <  XKEY  )  then  i=j  else  k=j 
goto  1 


These  examples  are  non-numeric.  Most  of  them  access  the  array(s)  in  a 
regular  manner,  like  the  examples  in  2.3.1.  However,  the  control  of  their 
sequencing  is  dynamic  in  nature:  it  depends  on  the  actual  data  being  visited, 
rather  than  on  regularly  incremented  counters. 

2.3.3  "Miscellaneous''  Style  Examples. 


0 


0 


0 


0 


0 


Example  4:  first  a  poor  quality  random-number  generator  is  defined: 
subroutine  RAND(R) 
j  a  i  •  65539 

if  G<0)  then  j  *  j  +  21474B3647  +  1 
R  =  j;  R  =  R  •  0.4656613e-9 
i  =  j;  kssk+1;  return 
then  it  is  called: 
do  1  k=M,20 
caU  RAND(R) 

1  if  (  R>  0.81  )  then  N[k]  =  1 

Knuth  comments:  "...the  most  interesting  thing  here,  however,  is 
the  effect  of  subroutine  linkage,  since  the  long  prologue  and  epilo¬ 
gue  significantly  increase  the  time  of  the  inner  loop”. 


Example  5:  this  is  a  long  inner  loop  that  does  lots  of  floating-point  computa¬ 
tions.  It  contains  some  simple  arithmetic  and  compare  &  branch 
operations  on  integer  counters,  sequential  addressing  of  two 
linear  arrays,  and  several  floating-point  exponentiations,  multipli¬ 
cations,  and  additions.  The  loop  is  badly  written,  with  many  large 
common  subexpressions.  There  is  lots  of  low-level  parallelism 
present,  mainly  among  the  floating-point  computations,  but  also 
between  them  and  the  integer  ones. 

Example  6:  a  subroutine  S  is  defined: 

subroutine  S(A,B,X) 
dimension  A[2],  B[2] 

X=0  ;  Y  =  (B[2]-A[2J)*12  +  B[l]  -  A[l] 
if  (Y<0)  then  goto  1 


X=Y 

1  return 

then  HIT  is  defined,  which  is  called  multiple  times,  and  which  calls  S: 
subroutine  W(A,B,C,D,X) 
dimension  A[2],  B[2],  C[2],  D[2],  U[2],  V[2] 

X=0  ;  call  S(A,D,X)  ;  if  (X==0)  then  goto  3 
call  S(C,B,X)  ;  if  (X==0)  then  goto  3 
rarely  executed  code 
3  return 

Example  8:  subroutine  COMPUTE  ;  common  .... 

complex  Y[10],  Z[10] 

R=real(Y[n])  ;  P=sin(R)  ;  Q=cos(R) 

S  =  C  •  6.0  *  (P/3.0  -  Q*Q*P) 

T  =  1.414214  *P*P*Q*C*  6.0 
U=T/2. 

V  =  -2.0  •  C  •  6.0  *  (P/3.0  -  Q*Q*P/2.0) 

Z[l]  =  (0.0.-1.0)  •  (  S*Y[1]  +  T*Y[2]  ) 

Z[2]  =  (0.0.-1.0)  •  (  U*Y[l]  +  V*Y[2]  ) 
return 

Example  14:  do  1  i=l,N 

1  C  =  C/D*R;  D  =  D-l  ;  R=.R+1 

Example  16:  real  function  F(X) 

Y  s  X  •  0.7071068 

if  (  Y  <  0.0  )  then  goto  1 
rarely  executed  code 
1  F  =  1.0  -  0.5  •  (1.0  +  ERF(-Y))  ;  return 


These  examples  help  us  remember  that  real  programs  are  not  always  as 
simple  and  straightforward  as  those  seen  in  sections  2.3.1  and  2.3.2.  Relative  to 
those  simpler  ones,  these  "miscellaneous"  programs  are  characterized  by  more 
numeric  computations,  the  same  number  or  fewer  array  accesses,  less 
index/counter  arithmetic,  less  or  unusual-style  comparisons  and  branches,  and 
—  in  some  cases  —  more  procedure  calls. 


2.3.4  The  Nature  of  Numeric  Computations. 

The  above  examples  give  a  picture  of  typical  numeric  computations,  which 


can  be  summarized  as  follows: 
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l.The  absolutely  predominant  data  structure  is  the  array.  Most  of  the  arrays 
are  1-  or  2-  dimensional.  (Of  course,  the  predominance  of  arrays  over  other 
data-structures  can  not  be  deduced  by  studying  FORTRAN  programs,  since 
arrays  are  the  only  data-structure  allowed  in  that  language.  However,  it  is 
known  that  the  vast  majority  of  numerical  computations  is  performed  to  solve 
engineering  or  other  similar  problems,  where  the  array  arises  as  the  natural 
data-structure.) 

2.1n  the  vast  majority  of  the  cases,  the  array  elements  are  accessed  in  regular 
sequence(s).  There  are  a  few  "working  locations"  in  the  array(s),  and  their 
addresses  change  as  arithmetic  progressions.  The  step  is  quite  often  equal  to 
one  element  size,  or,  at  other  times,  it  is  the  column  size  or  some  other  con¬ 
stant. 

3. A  few  integer  scalar  variables  are  used  as  loop-counters  and  array-indexes. 
The  arithmetic  performed  on  them  is  simple  and  corresponds  to  the  above 
"regular  sequence"  of  array  accesses:  increment  by  a  constant,  compare  & 
branch.  Address  computations  for  multi-dimensional  arrays  require  integer 
multiplication.  Most  of  the  times,  it  is  feasible  and  advantageous  for  the 
optimizing  compiler  (or  the  very  sophisticated  programmer)  to  replace  those 
integer  counters/indexes  by  actual  memory  pointers:  the  address  computa¬ 
tions  are  avoided  in  this  way  (see  [AhUl77],  p.466:  Induction  Variable  Elimina¬ 
tion). 

4. The  numeric  computations  are  usually  floating-point  operations  (multiplica¬ 
tions  and  additions/subtractions  being  the  most  frequent).  Several  such 
operations  are  performed,  but  usually  not  many  more  in  number  than  the 
integer  operations  on  counters. 

5. Low-level  parallelism  is  present  in  many  cases,  and  has  two  forms:  (1)  among 
various  floating-point  operations,  usually  when  long  expressions  are  computed, 
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and  when  a  series  of  assignment  statements  is  executed  with  no  control- 
transfers  in  between;  and  (2)  between  counter/address  calculations  and 
floating-point  operations,  especially  when  program  sequencing  (if's,  loop’s) 
depend  on  the  former  only.  This  quite  common  "static  nature"  of  program 
sequencing  is  an  important  characteristic  of  programs  which  perform  a  cer¬ 
tain  computation  on  all  elements  of  a  vector  or  of  an  array. 

6.The  last  property  also  gives  to  these  programs  significant  amounts  of  higher- 
level  parallelism.  Subsequent  loop  iterations  are  independent  and  could 
proceed  in  parallel.  Some  times,  they  are  completely  independent  (Example 
15  of  section  2.3.1),  so  that  a  highly  pipelined  von  Neumann  processor  could 
take  advantage  of  them.  Other  times,  they  are  less  independent  (Example  17 
in  section  2.3.1  would  require  a  tree-organized  addition);  von  Neumann  archi¬ 
tectures  and  languages  typically  cannot  exploit  that  parallelism. 


2.4  A  Study  of  four  C  Programs 

for  Text  Processing  and  CAD  of  IC’s. 

In  this  section  we  study  the  critical  loops  of  four  non-numeric  programs, 
written  in  C  and  taken  out  of  the  Berkeley  UNIXj  and  CAD  environment: 

fgrep  the  UNIX  program  which  searches  a  file  for  occurrences  of  fixed 
strings. 

sed  the  UNIX  stream  (batch)  text  editor, 
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mextra 

the  UNIX  program  to  sort  the  lines  in  a  file,  and 

a  circuit  extractor  [FitzMe]  which,  given  a  description  of  the  IC’s 
geometry,  generates  a  list  of  the  transistors  and  their  interconnec¬ 
tions  present  in  an  integrated  circuit.  It  works  by  first  reading-in  the 
description  of  the  geometry  and  building  a  corresponding  dynamic 
data  structure,  and  then  "scanning"  the  IC  following  horizontal  scan¬ 
lines  of  gradually  increasing  y-coordinate.  It  may  be  considered  an 
example  of  a  program  that  manipulates  a  nan-trivial  dynamic  data 
structure. 


As  an  argument  in  support  of  the  representativeness  of  the  above  sample  of 
programs,  let  us  look  at  a  typical  compiler.  Kessler’s  Pascal  compiler  spends 
most  of  its  time  [Kess82]  scanning  the  input  (i.e.  reading  and  recognizing  char¬ 
acters),  generating  assembly  code  (i.e.  character  I/O),  and  walking  through  tree 
structures  and  interrogating  them.  These  functions  are  similar  to  what  fgrep, 
sed,  and  mextra  do. 

The  tools  described  in  section  2.1.3  were  used  for  locating  the  critical  loops. 
Below,  wherever  code  is  shown,  the  number  on  the  left  of  each  line  is  the  count 
of  how  many  times  the  line  was  executed  during  the  test  run. 

2.4. 1  FGREP:  a  String  Search  Program. 

In  the  test  run,  fgrep  was  used  to  search  for  occurrences  of  the  string 
"kateveni"  in  a  file  of  size  »  230  KBytes  (there  were  a  few  hundred  such 
occurrences).  The  run  took  about  8  seconds  CPU  time,  allocated  as  follows: 

•  »  07%  in  the  procedure  execute  Q, 

•  «  11%  in_reod  (i.e.  in  the  operating  system), 

•  «  2%  in  everything  else. 

The  procedure  executeQ  follows: 
fgrep:  executeQ  [87%]: 


|  §  define  ccomp(a.b)  (yflag  ?  lca(a)==lca(b)  :  a==b) 
j  #  define  lca(x)  (isupper(x)  ?  tolower(x) :  x) 


229253 

229253 

226 


229252 

923 

228329 

0 


228329 

228329 

228329 

228329 


228329 

0 

228329 

0 


229252 

48 

229204 

4237 


struct  words  $ 

char  inp,  out; 

struct  words  *nst,  *Iink,  *fail; 
I  w[MAXSIZ]; 
int  yfiag; 


execute(file)  char  ’file; 

{  register  struct  words  *c; 
register  int  ccount; 
register  char  ch,  *p; 
char  buf[2*BUF'SIZ]; 
int  f,  failed;  char  *nlp; 


....  Initial  Set-Up  Work  .... 
for  (;;) 

{  if  (—ccount  <=  0) 

[  read-in  a  new  1Kbyte  block  or  exit  loop  J 
nstate: 

if  (ccomp(c*>inp,  *p))  /*  in-line  expansion  */ 

|  c  =  c->nst;  { 
else  if  (c->link  !=  0) 

{  c  =  c->link;  goto  nstate;  j 

else 

|  c  =  c->fail; 
failed  =  1; 
if  (c==0) 

{  c  =  w; 
istate: 

if  (ccomp(c->inp  ,  *p))  /•  in-line  exp.  •/ 

{  c  =  c*>nst;  J 
else  if  (c->link  !=  0) 

(  c=c->link;  goto  istate;  J 

i 

else  goto  nstate; 

l 

if  (c->out) 

(  Code  for  Success  j 
if  (*p++  ==  *\n') 

|  Code  for  End-of-Line  J 

J 

....  Final  Wrap-Up  Work  .... 


Figure  2.4. 1  contains  a  flow-chart  of  the  critical  loop  of  this  run  of  fgrep. 
The  vast  majority  of  the  operations  performed  are  simply: 

•  accesses  to  scalars  (mostly  locals)  and  indirections  through  them  to  access 
fields  of  structures  to  which  they  are  pointing,  and 

•  comparisons  (mostly  to  zero)  4:  subsequent  branches.  The  high  frequency  of 
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compare-fic-branches  is  in  part  a  result  of  the  nature  of  the  program  (pattern 
matching),  but  is  also  a  general  characteristic  of  the.  ncn-numeric  programs,  as 
the  next  examples  will  show. 


2.4.2  SED:  a  Batch  Text  Editor. 

In  our  test  run,  sed  copies  a  2.2  Mbyte  file  to  output,  searching  for 
occurrences  of  three  short  fixed  patterns.  It  replaces  two  of  them  with  2  others 
(one  shorter,  one  longer),  and  upon  encountering  the  third  one,  it  appends  a 
specified  new  line  after  the  current  one.  The  run  took  about  160  sec  CPU  time, 
allocated  as  follows: 


•  a  23%  in  the  procedure  execute (), 

•  a  23%  in  the  procedure  match(), 

•  «  16%  in  the  procedure  glineQ,  and 

•  all  other  procedures  accounted  for  <  B%  each. 


sed:  executeQ  [23%]: 
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1 

52820 

52820 

52819 

158457 

158457 

158457 

158457 

52819 

52819 

52819 

52819 

30899 

127558 

127558 

127558 

127558 

127558 


execute(file)  char  ‘file; 

(  register  char  *pl,  *p2; 
register  union  reptr  *ipc; 
int  c;  char  *execp; 

....  Initial  Set-Up  Work... 
for(;;) 

[  if((execp  =  gline(linebuf))  ==  badp)  {  rare  j 
spend  -  execp; 

for(ipc  =  ptrspace;  ipc->command;  ) 

{  pi  =  ipc->adl; 
p2  =  ipc->ad2; 
if(pl) 

[  if(ipc->inar)  ]  never  j 

else  if(*pl  ==  CEND)  [  never  j 
else  if(*pl  ==  CLNUM)  ]  never  j 
else  if(match(pl,  0))  [  Z2,000if\ s  executed  ] 
else  ]  68,000  statements  executed;  continue;  j 

if(ipc->negfl)  ]  never  j 
command(ipc); 
if(delflag)  [  never } 
if(jflag)  [  never  f 
else  ipc++; 


2  4.J 


3: 


*1 


I  J 

52819  j  if(!nflag  kk  Idelflag) 

2143025  j  {  for(pl  =  linebuf;  pi  <  spend;  pl++) 

/•  "spend"  is  a  global  pointer  •/ 

2143025  j  putc(*pl,  stdout); 

/•  Note:  in-line  expanded  to:  (  -_iob[l]._pnt>=0  ?  *(_iob[l]._ptr)++  =  *pl  :  {rare}  )  */ 
/•  and_jo6[!]._j)ir  are  global  scalars  (the  compiler  knows  their  address  •/ 

cnam  I  _/*V  _ i  \ . 


52819  | 

I 


52619 

52619 


I 

I 

I  i 
I  i 
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putc('\n‘,  stdout); 


if(aptr  >  abuf)  {  22,000  calls:  aroutQ; } 
delflag  =  0; 


Here,  we  have: 

•  0.26  M  procedure  calls, 

•  2.35  M  compare-Jc-branch, 

•  3.10  M  test-ie-branch. 

•  4.40  M  incrementations,  and 

•  0.50  M  assignments  with  no  operation  (move-type). 


The  vast  majority  of  operands  are  accessed  indirectly,  through  local 
#  .  pointers  with  a  zero  or  small  offset.  Other  accesses  are  to  local  and  global 

scalars.  Certainly,  a  lot  of  this  procedure’s  time  is  spent  in  the  tight  for  loop 
that  copies  characters  to  standard  output. 


sed:  matchQ  [23%]: 


161108 

|  match(expbuf,  gf)  char  *expbuf; 
j  J  register  char  *pl,  *p2,  c; 

161106 

if(gf)  j  Execute  «  150,000  statements  j 

156457 

j  else  \  pi  =  linebuf;  Iocs  =  0;  j 

161106 

j  p2  =  expbuf; 

161106 

if(*p2++)  {  never  } 

j  /*  fast  check  for  first  character:  •/ 

161106 

j  if(*p2  ==  CCHR) 

161108 

1  t  c  =  p2[lj; 

5242478 

|  do  {  if(*pl  !=  c)  continue; 

269623 

if(advance(pl,  p2))  J  infrequent  { 

5189445 

J  whlle(*pl++); 

106075 

return(0); 

1  1 

...Various  others,  never  executed... 

u 

0 
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sed:  glineQ  [16%]: 


52820 


2174691 

2174691 

2174690 

2121871 

2121871 


char  •gline(addr)  char  *addr; 

{  register  char  *pl,  *p2;  register  c; 

...Initial  Set-Up  Work  (100,000 statements  total)... 
for  (;;) 

[  if  (p2  >=  ebp)  $  rare  j 

if  ((c  =  *p2++)  =  =  ,\n‘)  {  infrequent  j 
if(c)  if(pl  <  lbend) 

•pl++  =  c; 

J 

...Pinal  Wrap-Up  Work  (200,000 statements  total)... 


These  two  procedures  spends  most  of  their  time  scanning  characters. 
Match()  scans  characters  searching  for  some  particular  one.  Gline()  scans 
characters  copying  and  checking  them. 


2.4.3  SORT:  an  Extreme,  but  Real  Case. 

The  particular  sorting  program  that  was  studied,  namely  the  one  installed 
on  our  UNIX  machines,  spent  one  third  of  the  test  run  time  in  its  calls  to  a  trivial 
procedure  blank()  used  to  scan  over  blanks.  Obviously,  it  is  preferable  that 
blankQ  were  defined  as  a  macro,  so  that  it  be  expanded  in-line.  The  test  run 
consisted  of  sorting  a  2.2  Mbyte  file,  relative  to  the  second -in-line  field  and  with 
elimination  of  duplicates.  It  took  half-an-hour  of  CPU  time. 


tort:  blankQ  [31%]: 

26087970  |  blank(c) 

26087970  |  {  if(c==’  f  ||  c==’\t’) 
8279488  j  return(l); 

19808482  j  else  return(0); 


In  general,  text-processing  programs  spend  a  lot  of  their  time  in  inner  loops 
where  they  sequentially  "walk”  through  the  characters  in  buffers,  copying,  com¬ 
paring,  or  testing  various  things. 


It  is  important  to  notice  that  programs  dealing  with  text  waste  a  lot  of 
memory  bandwidth  in  the  usual  architectures,  where  a  fuil  memory  word  is 
accessed  each  time  a  byte  transaction  takes  place. 

Exploitation  of  parallelism  is  difficult  in  these  programs,  because  of  the 
high  frequency  of  conditional  branches.  The  amount  of  work  done  between  two 
consecutive  branches  is  usually  quite  small,  with  limited  parallelism.  Parallel¬ 
ism  is  often  available  between  operations  in  two  different  blocks  Bx  and  f?2 
separated  by  a  conditional  branch,  where  the  branch  usually  follows  the  path 
that  makes  Bz  execute  after  Bx.  Programs  are  usually  written  in  such  a  way 
that  execution  of  2?a  cannot  start  before  it  is  certain  that  it  should  start.  The 
programmer  could  rearrange  the  code  and  introduce  temporary  variables  to 
hold  tentative  results,  but  doing  so  would  lead  to  complicated  and  hard  to  main¬ 
tain  programs. 


2.4.4  M EXTRA:  a  Circuit  Extraction  Program. 

Uextra's  test  run  consisted  of  extracting  the  circuitry  in  the  control  sec' 
tion  of  the  RISC  II  chip.  It  took  330  sec  CPU  time,  allocated  as  follows: 


•  w  14%  in  the  procedure  ScwnSubSwathQ, 

•  w  11%  in  the  procedure  Propagale(), 

•  «  10%  in  the  procedure  alloc(), 

•  «  8%  in  the  procedure  EndTrapQ, 

•  w  5%  in  the  procedure  FreeQ,  and 

•  the  remaining  procedures  took  <  4%  of  the  total  time  each. 


mextra:  Sc  an SubSwath  Q  [14%]: 


771 


ScanSubSwath(bin)  Int  bin; 

[  int  i,  newCount,  n; 

register  edge  *new, ‘old, "last,  *oldList,*newList;' 


...Initial  Set-Up  Work  (30,000  statements)... 

353237  j  while(new  !=  NIL  Itit  old  !=  NIL)  /*  NIL  is  0  */ 

352488  j  {  if(new->bb.l  <  old->bb.l)  {  infrequent  J 

else 
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(  if(n  <  old->bb.t) 

{  if(last  ==  NIL)  {  rare  j 

else  {  last->next  =  old;  last  =  old;  j 
old  =  oldList; 

if(old  !=  NIL)  oldList  =  old->nert; 

i 

else  |  infrequent  $ 

i 

if(depth[last->layer]  ==  0) 

StartTrap(last->layer,last); 
if((depth[last->layer]  +=  last->dir)  ==  0) 
EndTrap(last->layer,last); 

nextEnd  =  (nextEnd  <  last->bb.t  ?  nextEnd  :  last->bb.t); 

i 

...Fined  Wrap-Up  Work  (250,000  statements)... 

i 


This  procedure  performs  extensive  list  operations,  using  local  pointers.  The 
total  operations  performed  in  its  critical  loop  are: 


•  0.6  M  procedure  calls; 

•  0.3  M  additions  (not  counting  address  computations). 

•  l.B  M  test-&-branches; 

•  1.0  M  compare-&-branches; 

•  «  0.6  M  accesses  to  a  global  scalar  (nextEnd); 

•  6.5  M  accesses  to  locals  (96%  of  them  to  pointers) 

(these  include  accesses  for  indirecting  through  them); 

•  3.1  M  accesses  to  fields  of  structures  via  a  local  pointer,  and 

•  1.0  M  (random)  accesses  to  a  small  array  depfh[j0]. 


The  basic  pattern  of  memory  accesses  is  the  list  traversal,  which  places  a 
corresponding  limit  on  locality-of-reference.  However,  during  each  loop  itera¬ 
tion  there  are  11  accesses  to  fields  of  the  structures  pointed  to  by  "old->"  and 
by  "last->".  Accesses  to  various  fields  of  the  same  structure  are  obviously 
accesses  to  neighboring  memory  locations,  since  the  structure  nodes  here  have 
a  size  of  B  words.  Moreover,  there  are  repeated  accesses  to  the  same  field  of  the 
same  structure,  for  example  w  4  accesses  per  iteration  to  ”last->layer". 

The  available  parallelism,  is  again  limited  by  the  high  frequency  of  condi¬ 
tional  branches.  Some  parallelism  can  be  seen  between  accessing  a  memory 
location  and  computing  the  effective  address  for  a  subsequent  memory  access. 
For  example: 


302554 

254342 

25362B 

254342 

254342 

48212 

304254 

140442 

304254 

139794 

304254 


•> 


3' 


ti 


if  (  new->bb.l  <  old->bb.l ) 

if  (  (depth[last->layer]  +=  last->dir)  ==  0) 


mextra:  PropagateQ  [11%]: 


141443 


138000 


136083 

136083 

135024 

135024 

135024 


135024 


138083 


141443 


Propagate(y.yNext)  int  y,  yNext; 

[ int  layer,  height,  tempx.tempy; 
register  segment  ‘above,  ‘below,  ‘next,  ‘poly,  ‘difl: 

...Initial  Set-Up  Work  (8,000 statements)... 

for(  above=Above[layer];  above!=N!L;  above=above->next  ) 

\ 

for(  ;  below! =NIL  tck  below->right  <  above->left; 
below=below->next) 
if(below->area  !=  0)  \  rare  J 

for(  next=below;  next!=N!L  ScSc  next->left  <=  above->right; 
next=below->next) 

{  below  =  next; 

if(above->node  ==  0) 

l 

above->node  =  below->node; 

above->area  =  below->area  +  " 

height  •  (  above- >right  -  above->left  -  1  )  /  100; 
above->perim  =  below->perim  + 

2  *  (height  +  above->right  -  above->left  - 
MIN(above->right,below->right)  + 
MAX(above->left.below->left)  )  /  10; 

/*  Note:  In-line  expansions: 

MIN(x.y)  into:  (x<y  ?  x  :  y);  MAX(x.y)  into:  (x<y  ?  y  :  x)  •/ 
below->perim  =  below->area  =  0; 

l 

else  |  rare  } 

if(below->area  !=  0)  {  never  J 

i 

if(above->node  ==  0)  f  rare  J 

I 

...Final  Wrap-Up  Work  (500,000 statements)... 


Here  again,  extensive  list  operations  are  performed.  The  list-nodes  have  a 
size  of  8  words,  and  are  accessed  via  local  pointers.  During  each  loop  iteration, 
16  accesses  are  made  to  fields  of  a  certain  list-node,  and  15  to  fields  of  another. 
Each  individual  field  is  accessed  an  average  of  3  times.  This  procedure  has  more 
numeric  computations  than  the  other  procedures  in  this  section,  but  these  are 


still  not  the  dominant  factor. 
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n 
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m extra:  allocQ  [10%]: 


283165 


alloc(n) 

[  register  int  tmp;  register  struct  cell  *ptr; 


n 


283165 

283165 

283165 

283138 

258662 

258662 

258662 

241417 

241417 


283165 

283165 

283165 


if(n<CELLSJZE-4)  [  rare  j 

n  =  (n+WORDSIZE-l) /WORDSIZE;  /•  W0RDS1ZE  is  2  in  this  example  •/ 
if(TBLS!ZE<=n)  {  rare  j 

else  if(FreeTbl[n]!=0)  O 

[  ptr=FreeTbl[n];  FreeTbl[n]=ptr->next;  —FreeCnt[n]; 
if(ptr->status!=FREE  ||  ptr->count!=n)  [  never  j 
if(FreeCnt[n]!=0) 

[  if(FreeTbl[n]->status!=FREE)  [  never  J 
if(FreeTbl[n]->count!=n)  [  never  J 

1 

else  [  rare  J 

\ 

else  [  infrequent  J 

ptr->status  =  ALLOC;  ptr->count  =  n:  tmp  =  (int)  ptr; 
if(n<TBLS!ZE)  AllocCnt[n]++; 

return(tmp+4);  IS 


This  last  procedure  has  no  loop:  it  is  entered  many  times,  and  does  a  little 
■work  each  time.  Besides  accessing  fields  of  structures  via  pointers,  it  also 
makes  many  references  to  the  nth  elements  of  severed  arrays.  These  latter  are 
not  sequential  array-element  accesses.  However,  if  the  information  were  kept  in 
a  single  array  of  structures,  instead  of  in  multiple  simple  arrays,  then  the  above 
accesses  would  all  be  to  neighboring  memory  locations.  Slightly  more  parallel¬ 
ism  can  be  found  here,  for  example: 

(ptr->status=ALLOC;  ptr->count=n;  tmp=(int)ptr;  if(n<TBLSIZE)  ...J. 

Also,  notice  that  the  if's  that  lead  to  then-clauses  which  never  get  executed  are 
consistency  checks,  and  they  could  all  be  done  in  parallel  if  the  language 
allowed  some  way  of  expressing  that. 


The  overall  picture  from  this  CAD  program  is  one  of  many  conditional 
branches  and  of  many  accesses  to  fields  of  structures  using  local  pointers  point¬ 
ing  to  them.  Although  the  application  has  some  arithmetic  that  needs  to  be 
done,  it  does  not  play  a  dominant  role.  There  are  very  few  increment 


2.4.4 


O 


operations,  contrary  to  the  previous  programs  studied  in  earlier  sections, 
because  this  program  deals  with  dynamic  data  structures.  The  locality-of- 
references  to  the  elements  of  the  data  structures  stems  from  the  computation 
pattern  of  performing  several  accesses  to  various  fields  of  a  few  structure 
instances,  before  interest  shifts  to  some  new  such  instances. 


2.5  Summary  of  Findings. 

In  this  chapter,  we  first  reviewed  static  and  dynamic  program  statistics  col¬ 
lected  by  other  researchers.  Their  results  indicate  that  the  simplest  operations 
are  also  the  ones  that  are  executed  most  of  the  time. 

Then,  we  looked  at  several  FORTRAN  programs,  most  of  them  doing  numeri¬ 
cal  computations.  We  observed  that  they  perform  primarily  floating-point  arith¬ 
metic  operations  on  operands  which  frequently  are  elements  of  arrays.  The 
inner  loops  usually  traverse  the  arrays  in  a  "regular”  fashion,  using  indexes  that 
are  incremented  by  a  constant  amount  and  compared  to  a  limit.  The  use  of 
pointers  rather  than  indexes,  by  the  programmer  or  by  the  optimizing  compiler, 
would  be  advantageous. 

Then,  we  studied  some  text-processing  programs  written  in  C,  and  saw  that 
they  spend  a  large  fraction  of  their  time  running  sequentially  through  character 
buffers.  These  are  array  elements,  again,  but  here  programmers  usually  access 
them  indirectly  through  local  pointers.  The  dominant  operations  are  not  arith¬ 
metic  any  more  —  they  are  tests  or  comparisons  for  branching  and  mere  copy- 
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Finally,  we  analyzed  a  program  for  CAD  of  IC’s,  which  manipulates  a  non¬ 
trivial  dynamic  data  structure.  The  fields  of  a  few  nodes  (structures)  are 
accessed  several  times  indirectly  through  local  pointers,  before  the  program 
shifts  its  attention  to  some  other  nodes  linked  to  the  previous  ones.  Again,  we 
found  high  frequencies  of  test/compare-&-branch  and  of  copying. 

In  all  cases,  we  saw  that  programs  are  organized  in  procedures  and  that 
procedure  calls  are  frequent  and  costly  in  terms  of  execution  time.  Procedures 
usually  have  a  few  arguments  and  local  variables,  most  of  which  are  scalars,  and 
are  heavily  used.  The  nesting  depth  fluctuates  within  narrow  ranges  for  long 
periods  of  time. 

We  found  low-level  parallelism  although  usually  in  small  amounts,  mainly 
between  address  and  data  computations.  The  frequent  occurrence  of 
conditional-branch  instructions  greatly  limits  its  exploitation. 

General-purpose  computations,  as  usually  expressed  in  von  Neumann 
languages,  are  carried  out  by  walking  through  static  or  dynamic  data  structures 
in  some  -  usually  regular  -  path.  Operand  addressing,  copying,  and  comparing 
for  decision  making,  are  factors  of  prime  importance.  Procedures  are  heavily 
used  for  hierarchical  organizations.  Numeric  computations  are  frequent  and 
expensive  in  some  applications. 

In  the  next  chapters,  possible  architectural  features  for  exploiting  these 
program  characteristics  will  be  presented. 
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2.rel 


CHAPTER  3: 


THE  RISC  I  &  II 
ARCHITECTURE 
AND  PIPELINE. 


In  chapter  1  the  complexity-speed  trade-off  was  discussed,  and  the  impor¬ 
tance  of  effective  utilization  of  hardware  resources  was  stressed.  In  chapter  2 
we  observed  the  predominance  of  operand  addressing  and  accessing,  of  com¬ 
parisons,  and  of  conditional  branching  in  general-purpose  computations. 

In  this  chapter  the  architecture,  the  pipeline,  and  the  basic  timing  of  RISC  I 
and  II  are  presented.  Architecture  and  micro-architecture  discussions  are 
intermixed  because  an  understanding  of  the  implementation  Is  essential  in  mak¬ 
ing  architectural  decisions  that  lead  to  a  high-performance  processor.  It  is 
shown  how  the  RISC  I  4:  II  architecture  efficiently  supports  integer  general- 
purpose  computations  with  a  reduced  instruction  set,  allowing  for  compact  and 
fast  implementation.  Benchmark  measurements  of  RISC’s  performance  are 
reviewed. 

The  next  two  chapters  deal  with  the  design,  layout,  debugging,  and  testing 
of  RISC  II,  while  chapter  6  discusses  possible  hardware  enhancements  to  RISC- 
style  processors,  for  increased  performance.  A  detailed  and  exact  description 
of  the  RISC  II  architecture  -  for  reference  purposes  -  can  be  found  in  Appendix  A. 


3.1 


The  RISC  I  k  II  Instruction  Set. 


The  architecture  of  the  Berkeley  RISC  is  register-oriented,  because  pro¬ 
gram  measurements  indicate  that  supporting  fast  operand  accesses  is  of  utmost 
importance.  On  the  one  hand,  the  compiler  by  default  allocates  some  frequently 
used  program  variables  into  registers.  On  the  other  hand,  operations  onto 
operands  in  memory  are  decomposed  into  their  orthogonal  subtasks  of  first 
bringing  the  operands  into  registers,  then  performing  the  operation,  and  last 
moving  the  result  to  memory.  This  decomposition  brings  no  loss  in  performance 
■when  proper  pipelining  is  utilized,  while  it  simplifies  the  instruction  set  and  its 
implementation. 

All  RISC  instructions  have  a  fixed  width  of  one  word  for  simplicity  and 
efficiency  of  the  instruction  fetch-and-sequence  mechanism.  The  instruction 
format  is  simple,  with  fields  at  fixed  locations,  for  simple  and  fast  instruction 
decoding.  The  operations  performed  by  the  instructions  all  fit  within  the  same 
general  framework,  allowing  for  a  simple  and  fast  data-path,  for  high  utilization 
of  the  data-path  resources,  and  for  a  simple  homogeneous  timing  scheme. 

RISC  is  a  32-bit  architecture,  since  a  4  Giga-Byte  virtual-address  space  is 
believed  to  be  enough  for  the  next  several  years.  Bytes,  half-words,  and  full- 
word  integers  (32  bits)  are  supported  in  memory;  they  are  all  converted  into 
full-words  when  moved  into  registers.  This  offers  simplicity,  while  maintaining 
full  operational  flexibility  with  integers  and  characters. 

This  section  describes  and  discusses  the  RISC  instruction  set,  but  does  not 
rigorously  define  it.  Refer  to  Appendix  A  for  an  exact  definition. 


3.1.1  Register-Oriented  Organization. 


From  one  point  of  view,  a  computer  does  70%  operand  accesses  and  30% 
operations:  for  each  operation  one  or  two  source  operands  are  required,  and  the 
result  is  placed  into  another  operand.  When  one  also  considers  the  high  fre¬ 
quency  of  A-gets-B  type  assignments  (§  2.2.1  [Tane?8],  [AlWo?5];  beginning  of 
sect.  2.3;  sect.  2.4),  where  there  are  operand  references  but  no  operation,  one 
realizes  how  important  it  is  for  a  computer  system  to  have  quick  access  to 
operands.  The  fastest  storage  device  is  a  CPU  register,  not  only  because  the 
register  file  is  physically  small  and  on  the  same  chip  as  the  CPU,  but  also 
because  addressing  is  made  with  a  much  shorter  address  than  for  cache  or 
memory.  For  these  reasons,  RISC  tries  to  keep  as  many  of  its  operands  as  possi¬ 
ble  in  registers.  It  is  not  enough  to  keep  only  temporary  unnamed  results  in  the 
registers,  because  expressions  are  usually  very  short  (see  above  references), 
and  hence  there  are  not  too  many  such  intermediate  results.  The  latter  is  also 
the  reason  why  an  expression-evaluation  stack  and  a  stack  architecture  were 
not  chosen  for  RISC.  The  Berkeley  RISC  architecture  has  many  registers,  organ¬ 
ized  in  multiple  overlapping  windows,  and  its  compiler  by  default  allocates 
scalar  arguments  and  local  variables  of  procedures  in  them.  Multi-window  regis¬ 
ter  files  will  be  discussed  further  in  sections  3.2,  6.1,  and  6.2. 

In  RISC  I  and  II,  operations  are  performed  by  3-operand  register-to-register 
instructions: 

Ri  R$\  op  52 

Besides  variables,  immediate  constants  are  also  quite  important  in  computa¬ 
tions.  In  sect.  2.2.2  we  saw  that  they  account  for  15  to  20  %  of  the  operands 
used.  Thus,  in  the  above  generic  instruction,  the  second  source  operand  52  can 
be  either  a  register  R,i  or  an  immediate  constant  imm  (see  sect.  3.1.4).  One  of 
the  registers,  namely  i?o.  always  contains  the  hardwired  constant  zero.  Writing 


into  it  is  allowed,  but  will  not  change  its  value. 


The  available  operations  op  are: 


•  integer  addition  (without  or  with  carry), 

•  integer  subtraction  (without  or  with  carry), 

•  integer  inverse  subtraction  (-/?tl+52)  (without  or  with  carry), 

•  bitwise  boolean  AND,  OR,  Exclusive  —Or , 

•  shift  left-logical,  right-logical,  or  right-arithmetic  (all  by  an  arbitrary 
amount). 

All  these  instructions  can  optionally  set  the  4  existing  condition  codes  (CC’s). 
The  add/sub  instructions  assume  32-bit  signed  2's-compiement  operands.  How¬ 
ever,  there  are  conditions  for  branching  on,  which  will  act  as  if  the  operation 
(comparison)  were  between  unsigned  32-bit  quantities.  The  ■with-carry  versions 
of  add/sub  can  be  used  for  multi-word  precision  arithmetic.  The  shift  instruc¬ 
tions  will  shift  Rtl  by  the  amount  (0  through  31  bit-positions)  specified  in  the  5 
least-significant  bits  of  52.  The  logical  shifts  fill  the  emptied  bit  positions  with 
zeros,  whereas  the  arithmetic  shift-right  sign-extends  the  leading  bit.  Rotates 
and  arithmetic  shift-left  are  not  included,  because  they  do  not  exist  in  HLL’s. 
Shifts  by  arbitrary  amounts  (more  than  1  or  2  bit-positions)  are  not  frequent  in 
HLL's.  Thus,  their  inclusion  into  our  instruction  set  was  contrary  to  the  RISC 
philosophy;  section  4.2  will  show  the  negative  consequences  of  this  decision. 

Several  general  and  frequent  operations,  which  do  not  appear  explicitly  in 
the  above  list,  can  readily  be  synthesized  using  the  options  available: 


Instruction: 

Method  of  Synthesizing  it: 

move 

R«  «-  R.+  R  o 

increment,  decrement 

use  add  with  immediate  constant  of  1,  -1 

complement 

Rq  —  R ( 

negate  (NOT) 

R ,  XOR  ’’-l” 

clear 

R^  R0  +  R( j 

compare,  test 

use  R o  as  RA,  and  set  condition  codes  (CC’s). 

3.1.2  Memory  Accessing,  and  Addressing  Modes. 

In  RISC  I  6c  II  all  arithmetic,  logical,  and  shift  instructions  operate  on  regis¬ 
ters.  Only  the  load  and  store  instructions  can  access  operands  in  memory  and 
move  them  to/from  registers.  This  simplifies  the  processor's  data-path  and  con¬ 
trol,  the  instruction  format,  and  the  handling  of  interrupts  caused  by  demand¬ 
paging.  Related  performance  issues  are  discussed  in  §  3.3.2  and  3.3.3. 

Load  and  store  instructions  have  a  single  addressing  mode: 

R&  *--*  H  [  Rg  1+52] 

The  result  of  a  RISC  add  instruction  is  used  as  effective  address  for  a  memory 
access.  This  single  addressing  mode,  which  matches  well  with  the  rest  of  RISC's 
instructions,  is  quite  versatile  and  permits  one  to  synthesize  many  other 
addressing  modes: 


Mode:  HLL  usage: 

Absolute  or  direct  global  scalar 


Synthesizing  it  in  RISC: 
M  [  /ffi+imm  3 
(within  ±4Kbytes  of  base  Rg) 


Register  indirect  pointer  deref.  (*p)  M  Rp+R0  ] 

Indexed  field  of  struc.  (p->field)  M  '  Rp+fieldjf/s  ] 

Indexed  linear  byte  array  (a[i})  M  /?*+/?<  ] 


[assume  R„  points  to  the  base  of  af 


Notice  that  the  last  mode  can  only  be  applied  to  byte  arrays  and  not  to  arrays  of 
half-  or  full-words,  because  no  scaling  by  2  or  4  is  done  on  52.  The  lack  of  such 
scaling  also  reduces  the  range  of  addresses  accessible  with  the  13-bit  52 
immediate  offset.  The  reason  for  this  lack  is  that  there  is  no  circuit  in  RISC  I  6c 
11  for  both  shifting  and  adding  in  one  Instruction.  The  modification  proposed  in 
section  4.3  would  amend  this  situation. 

The  RISC  II  implementation  has  one  notable  exception  from  the  above  uni¬ 
form  addressing  schemr:  the  second  source  52  must  be  an  immediate  constant 
for  store  instructions;  thus,  the  last  addressing  mode  in  the  table  cannot  be 
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synthesized  for  store  Instructions.  The  reason  has  to  do  with  implementation. 
The  register  file  has  two  ports  because  ail  instructions  read  two  source  operands 
from  it.  A  store  instruction  with  a  register-52,  however,  would  need  to  read 
three  registers:  R*,  R, lt  and  52.  This  could  not  be  accommodated  without 
major  penalties  for  the  data-path,  neither  could  R&  be  read  conveniently  at  a 
later  time.  Since  that  addressing  mode  is  not  important  enough  to  justify  such 
penalties,  the  feature  was  left  out,  in  accordance  with  the  RISC  concept. 

Memory  addresses  in  RISC  are  byte  addresses.  Half-word  quantities  are 
aligned  on  half-word  boundaries,  and  full-word  quantities  on  full-word  boun¬ 
daries.  Half-words  and  bytes  are  always  right-adjusted  when  they  are  in  regis¬ 
ters.  The  load  and  store  instructions  have  different  versions  for  full-word,  half¬ 
word,  and  byte  transfers.  These  versions  perform  the  necessary  change  in  align¬ 
ment  between  memory  and  registers.  The  store  instruction  assumes  that  the 
memory  system  is  capable  of  selectively  writing  into  some  of  the  4  bytes  of  a 
word.  Different  versions  of  the  load  instruction  exist  for  bringing  signed  or 
unsigned  short  quantities  into  registers  with  sign-extension  or  zero-filling. 

The  versatile  addressing  mode  of  the  memory  access  instructions  is  also 
used  for  the  control-transfer  instructions  (jump,  call,  return)  in  RISC  I  &  II. 
However,  because  the  Program  Counter  (PC)  is  not  in  the  register  file,  a 
separate  PC-relative  addressing  mode  was  added  for  control  transfers: 

9ffectiva_g.ddr8ss  =  PC  +  imm 

Once  that  mode  existed  for  jump/call/return  instructions,  it  required  a  trivial 
hardware  extension  to  use  it  for  load  and  store  instructions  as  well.  This  was 
done  to  allow  the  generation  of  relocatable  code  for  separately  compiled 
modules.  Global  data  can  be  allocated  next  to  the  code,  and  referenced  relative 
to  the  PC. 
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However,  later  this  turned  out  to  be  a  bad  idea.  One  wants  to  keep  code 
and  data  separate  for  allowing  shared  read-only  code-segments,  and  for  being 
able  to  have  separate  instruction  and  data  caches  (sect.  6.3,  6.4).  PC-relative 
data  accesses  also  preclude  the  remote-PC  scheme  (sect.  6.3).  Finally,  the  use 
of  a  linkage  editor  does  not  pose  any  serious  problems  when  the  code  is  not  relo¬ 
catable. 

3.1.3  Delayed  Control  Transfer. 

The  Berkeley  RISC  architecture  was  designed  with  pipelining  in  mind  from 
the  very  beginning.  In  particular,  overlapping  of  instruction  fetch  and  execution 
was  assumed.  This  pipeline  is  disrupted  by  control  transfer  instructions,  such  as 
conditional  and  unconditional  jumps,  procedure  calls  and  returns.  Figure  3.1.1 
shows  a  control-transfer  instruction  Ix  being  fetched  and  then  executed.  During 
its  execution,  it  computes  the  address  of  its  potential  target  and  evaluates  the 
condition  for  conditional  branching.  Simultaneously,  the  subsequent  instruction 
/g  is  being  fetched.  When  execution  of  Ix  has  completed,  the  fetching  of  its  tar¬ 
get  7a  can  start. 


Figure  3.1.1:  Delayed-Branch  Scheme. 


Instead  of  flushing  Iz  from  the  pipeline,  and  thus  wasting  one  cycle,  RISC 
employs  the  delayed-branch  scheme.  In  that  scheme,  the  transfer  of  control  to 
/g  takes  effect  with  a  delay  of  one  instruction.  Iz  is  executed  regardless  of 
whether  the  control-transfer  is  successful  or  unsuccessful.  Thus,  an  instruction 

i 

immediately  after  a  jump/c  all/re  turn  effectively  belongs  to  the  block  preceding 
the  transfer-instruction.  The  compiler  puts  a  no-op  at  that  place,  while  the 

0 

optimizer  tries  to  move  a  suitable  task  to  that  place.  This  can  be  done  when  the 
transfer-instruction  does  not  depend  on  that  task. 

Measurements  have  shown  [CampBO]  that  the  optimizer  is  able  to  remove 
about  90%  of  the  no-ops  following  unconditional  transfers  and  40%  to  60%  of 
those  following  conditional  branches.  The  unconditional  and  conditional 
transfer-instructions  each  represent  approximately  10%  of  all  executed  instruc¬ 
tions  (20%  total).  Thus,  while  a  conventional  pipeline  would  lose  «  20%  of  the 
cycles,  optimized  RISC  code  only  loses  about  8%  of  them.  These  rough  calcula¬ 
tions  assume  that  most  RISC  instructions  execute  in  one  cycle  (which  is  not  far 
from  true).  The  above  figure  agrees  with  a  similar  figure  given  by  John  Cocke  of 
the  IBM  "Watson  Research  Center  during  an  informal  discussion  [CockB3];  the 
two-cycle  branches  executed  by  optimized  B01  programs  are  about  6%  of  all  exe¬ 
cuted  instructions. 

For  the  small  fraction  of  the  cycles  that  the  optimizer  cannot  utilize,  an 
actual  no-op  instruction  has  to  be  inserted  at  the  place  of  Iz,  consuming  some 
code  space.  The  current  RISC  I  &  II  architectures  have  no  special  versions  of  the 
transfer-instructions  that  automatically  suspend  execution  during  tLe  next 
cycle.  This  choice  was  made  for  simplicity.  In  retrospect,  we  could  have 
reduced  code  size  by  8%  by  adding  the  suspension  capability  with  minimal 
penalty.  The  area  penalty  would  be  about  0.1%  for  a  circuit  that  flushes  Iz  from 
the  input  of  the  opcode-decoder  and  replaces  it  with  a  no-op.  There  would  be  no 


time  penalty  because  that  decoder  is  still  small  enough  so  that  it  does  not  affect 
the  critical  timing  path  (g  4.3.3). 

Control-transfer  instructions  use  the  same  addressing  modes  as  load’s  and 
store's.  PC-relative  is  the  preferred  mode  for  jumps  within  a  procedure,  while 
register-indexed  jumps  can  be  used  for  table-driven  case  statements.  Cedi 
instructions  save  the  value  of  the  PC  into  a  register.  The  return  instruction  is 
reg  ster-indexed  only;  it  uses  the  contents  of  the  register  where  the  correspond¬ 
ing  call  had  saved  the  PC.  In  RISC  II  the  return  instruction  is  a  conditional  one. 
just  like  jumps.  The  usefulness  of  such  instructions  is  very  limited,  but  their 
implementation  resulted  quite  naturally.  Later,  it  turned  out  that  conditional 
returns  interfere  with  the  critical  path  of  interrupt  assertion  because  an 
overflow  trap  should  not  occur  on  an  unsuccessful  return.  This  is  another  case 
where  deviation  from  the  RISC  concept  led  to  implementation  penalties. 

3.1.4  Fixed  Instruction  Format. 

An  important  contribution  to  processor  complexity  comes  from  instruction 
decoding,  and,  in  particular,  from  the  task  of  extracting  the  various  instruction 
fields.  RISC  I  Sc  II  have  a  simple  instruction  format,  with  fixed  field  positions. 
This  led  to  a  very  simple  and  fast  decoding  circuit  (g  4.3). 

All  RISC  instructions  are  full-words  of  32  bits.  This  greatly  simplifies  the 
instruction  fetch  and  decoding  task.  Figure  3.1.2  shows  the  two  instruction  for¬ 
mats  employed.  Thirty-two  registers  are  visible  to  the  compiler  at  any  one  time, 
and  thus  a  5-bit  field  is  necessary  for  specifying  the  sources  and  the  destination. 
There  is  space  for  128  opcodes,  although  only  39  of  them  lire  currently  used. 
One  bit  (SCC)  in  every  instruction  can  specify  the  optional  setting  of  the  condi¬ 
tion  codes  according  to  the  result  (Jtd)  of  the  instruction. 
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SCC-bit  (a)  Short-Immediate  Format. 
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opcode  DEST  imml9 
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SCC-bit  (b)  Long-Immediate  Format. 
Figure  3.1.2:  Instruction  Formats. 


The  DEST  field  may  specify  one  of  two  things  in  both  instruction  formats, 
according  to  the  opcode.  For  conditional  control-transfer  instructions,  its  4 
least-significant  bits  specify  the  branch-condition  (its  MS  bit  is  unused).  For  all 
other  instructions,  the  DEST  field  specifies  the  /?*  register  number. 

The  short-immediate  format  is  used  for  all  register-to-register  instructions 
and  for  register-indexed  load,  store,  and  control-transfer  instructions.  The 
shortS0URCE2  field  consists  of  the  14  bits  that  are  left  over  after  the  assignment 
of  the  other  fields.  Its  leading  bit  specifies  whether  it  should  be  interpreted  as 
/?,2  or  as  an  immediate  constant.  In  the  former  case,  8  of  the  14  bits  in  the  field 
are  discarded  (wasted).  In  the  latter  case,  a  13-bit  signed  2's-complement 
immediate  constant  is  assumed. 

The  long-immediate  format  is  used  for  all  PC-relative  instructions.  Since 
the  PC  is  the  first  source  for  them,  these  instructions  need  no  R,x  and  can  have 
a  wider  Immediate  constant. 
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This  format  is  also  used  for  the  load-high  instruction,  which  takes  a  19-bit 
immediate  and  places  it  into  the  19  high-order  bits  of  R*,  simulteneously  zero¬ 
ing  the  13  low-order  bits.  Load-high  can  be  used,  in  conjunction  with  the  13-bit 
immediate  field  of  the  following  instruction,  for  loading  any  arbitrary  32-bit  con¬ 
stant  into  a  register.  This  method  of  introducing  arbitrary  constants  into  regis¬ 
ters  requires  64  bits  of  code  space  and  2  execution  cycles.  It  was  preferred  over 
the  more  complex  alternative  relying  on  a  single,  longer  instruction  format  that 
could  hold  the  whole  32-bit  constant.  Since  the  memory  bus  is  only  32-bits  wide, 
two  cycles  would  still  be  required  for  fetching  such  an  instruction;  thus  there 
would  be  no  performance  gain.  The  size  of  that  instruction  would  have  to  be  64 
bits  for  proper  alignment  in  memory,  and  no  gains  in  code  size  would  result 
either.  Finally,  a  PC-relative  load  instruction  can  be  used  for  the  same  purpose, 
but  that  means  that  parts  of  a  code  memory  segment  are  read  as  data  (see  end 
of  §  3.1.2). 

3.1.5  Lack  of  String,  Multiply,  Floating-Point  Support. 

RISC  I  &  II  have  no  support  for  character-string  operations,  integer  multipli¬ 
cations  or  divisions,  or  any  kind  of  floating-point  operations.  There  are  various 
reasons  for  this. 

Hardware  support  of  some  of  these  functions  requires  considerable  silicon 
area,  for  instance  a  parallel  multiplier,  a  floating-point  unit,  or  support  for  more 
sophisticated  string  operations.  If  such  a  unit  were  included  in  the  cenfraf 
data-path,  the  basic  cycle  time  would  be  severely  lengthened,  due  to  increased 
size  and  capacitance.  In  §  4.2.4  we  will  see  how  even  the  moderately-sized 
shifter  slows  down  the  basic  machine  cycle.  Another  alternative  is  to  place 
these  units  on  the  CPU  chip,  but  outside  the  central  data-path.  This  would  make 
access  to  them  slower  than  access  to  the  integer  adder,  but  it  wouldn’t 
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appreciably  slow  down  the  other  operations. 


A  third  alternative  is  to  place  special  hardware  off  the  CPU  chip,  for 
instance  as  a  co-processor.  Due  to  chip  area  limitations,  the  latter  is  the  most 
attractive  solution  for  today’s  technology.  According  to  the  views  expressed  in 
chapter  6,  the  integration  of  functional  units  such  as  an  instruction  cache  on  the 
CPU  chip  is  very  desirable  and  has  higher  priority  than  the  integration  of  a  large 
arithmetic  unit.  That  means  that  the  co-processor  solution  will  remain  attrac¬ 
tive  during  the  next  few  years  as  well.  Co-processor  architecture  and  interfac¬ 
ing  is  a  large  and  important  research  area,  which  could  not  be  undertaken  as 
part  of  the  Berkeley  RISC  project.  For  these  reasons,  RISC  I  &  II  have  no  parallel 
multiplier  or  floating-point  hardware.  Partial  support  for  integer  multiplication 
in  the  form  of  one  step  of  Booth's  algorithm  is  a  feasible  and  attractive  solution. 
The  main  reason  why  this  was  not  included  in  RISC  I  &  II  is  the  lack  of  man-power 
for  their  design. 

The  situation  for  character  string  operations  is  different.  These  have  not 
yet  been  standardized  in  the  High-Level-Languages  themselves.  Most  C  pro¬ 
grams  perform  them  at  the  lowest  character-by-character  level  (see  for  exam¬ 
ple  the  procedure  gline  in  sect.  2.4.2).  Pascal  does  not  even  have  variable-size 
strings  —  it  merely  has  fixed  size  character  arrays.  Under  these  circumstances, 
it  obviously  makes  no  sense  for  the  hardware  to  support  something  that  the  HLL 
itself  does  not  support.  This  is,  nevertheless,  a  very  interesting  area  for  future 
research  and  standardization.  It  is  wasteful,  in  terms  of  memory  bandwidth,  to 
deal  with  strings  at  the  character-leveL  One  could  decide  to  align  all  strings  on 
word  boundaries,  and  mark  their  end  by  null-byte  paddins  in  the  last  word  (one 
or  more  null  bytes  to  fill  the  word).  Co-processors  are  well  suited  for  supporting 
string  operations  as  well,  since  strings  are  most  likely  to  be  kept  in  memory, 
and  a  co-processor  hanging  off  the  memory  bus  can  process  them  as  they  go  by 
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The  RISC  I  &  II  Register  File 
with  Multiple  Overlapping  Fixed-Size  Windows. 
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The  importance  of  fast  operand  accesses  and  the  desirability  of  keeping  as 
many  of  them  as  possible  in  CPU  registers  were  presented  at  the  beginning  of  § 
3.1.  Registers  are  few  in  number,  and  instructions  address  them  directly  by 
their  name.  For  both  of  these  reasons  they  can  only  be  used  to  hold  scalar  vari¬ 
ables.  One  way  of  deciding  which  scalars  to  keep  in  registers  is  to  rely  on  hints 
from  the  programmer,  as  are  available  in  the  language  C.  For  global  variables, 
this  is  probably  the  only  way  the  compiler  can  know  which  ones  to  allocate  in 
registers  since  programs  usually  have  more  global  variables  than  the  machine 
has  registers. 

But  the  situation  is  different  for  local  variables.  The  measurements  of 
Tanenbaum.  and  of  Halbert  and  Kessler  given  in  §  2.2.2  show  that,  for  more  than 
95%  of  the  dynamically  called  procedures,  12  words  of  storage  are  enough  for  all 
their  arguments  and  locals.  Thus,  it  is  feasible  for  an  architecture  to  have 
enough  registers  so  that  the  compiler  can  allocate  local  scalars  into  registers  by 
default.  In  case  not  all  of  them  fit,  the  compiler  will  simply  place  the  remaining 
variables  in  memory.  The  decision  is  not  critical  since  the  latter  cases  are  so 
rare.  The  measurements  from  [PaSe82]  reported  in  §  2.2.2  showed  that  out  of 
100  HLL  operand  references,  about  20  were  to  constants,  55  to  scalars,  and  25  to 
arrays/structures.  Yfe  can  exclude  constants  from  our  count,  since  they  are 
accessed  as  part  of  the  instructions  themselves.  We  can  also  safely  assume  that 
for  every  HLL  array /structure  reference  there  is  also  at  least  one  access  to  a 
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scalar  at  the  machint  level:  the  array  index  or  pointer  to  structure.  Thus,  a 
more  representative  ratio  ay  be:  55+25=80  scalars  accesses  versus  25  non- 
scalar  ones.  Data  from  [PaSe82]  and  from  our  own  measurements  reported  in  § 
2.4,  indicate  that  about  80%  of  the  scalar  references  are  to  local  scalars.  Thus, 
about  80%  of  all  accesses  are  made  to  local  scalars  and  about  40%  of  them 
access  all  other  kinds  of  objects.  Since  so  few  words  are  accessed  with  such  a 
high  frequency  and  with  direct  addressing,  allocating  them  into  registers  is  the 
obvious  way  of  providing  fast  access  to  them. 

The  problem  with  keeping  locals  in  registers  is  that  they  have  to  be  saved 
on  every  procedure  call  and  restored  from  memory  on  every  return.  This  is  the 
main  source  of  the  very  high  cost  of  procedure  calls  in  terms  of  execution  time 
(25  to  40  %.  §  2.2.1:  [PaSe82]  [Lund77]).  Argument  passing  is  the  second  main 
source  of  cost.  But,  while  procedure  calls  occur  frequently,  roughly  once  every 
8  HLL  statements  (§  2.2.1:  (AIWo75j  [Pa5e82]  [Tane78]),  strong  variations  of  their 
dynamic  nesting  depth  are  rarely  observed.  This  locality  of  the  nesting-depth 
means  that,  if  sufficient  register  storage  is  provided  for  a  few  activation 
records,  instead  of  only  for  one,  then  register  saving  and  restoring  can  be 
reduced  dramatically.  This  led  Halbert  and  Kessler  to  propose  a  large  register 
file  with  multiple  overlapping  windows  for  the  RISC  architecture  [HaKe80].  Pre¬ 
vious  proposals  on  this  subject  had  been  made  by  R.  Sites  [Site79],  and  F. 
Baskett  [Bask78],  although  their  schemes  differed  from  the  one  used  in  RISC  I  k 
II.  Multiple  register  windows  had  appeared  in  processors  before  RISC,  but  they 
were  usually  intended  for  multiple  processes  rather  than  for  procedures,  or  they 
had  no  overlap.  More  measurements  and  studies  on  multiple  windows  can  also 
be  found  in  [DiML82]  and  [TaSe83].  The  latter  paper  studies  the  problem  of 
optimally  memaging  the  RISC  register  file. 
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3.2. 1  Overlapping,  Fixed-Size  Windows. 

Figure  3.2.1  (a)  shows  the  organization  of  a  register  file  into  fixed-size,  over¬ 
lapping  windows.  Not  all  CPU  registers  are  simultaneously  visible  by  the 
machine  language  programmer  at  any  given  time.  The  ones  that  are  visible  are 
called  "the  current  window".  The  window-number  inputs  to  the  decoder  select 
the  current  window.  They  are  supplied  by  the  CPU  state.  The  register-number 
inputs  to  the  decoder  are  supplied  by  the  instruction,  and  they  select  one  regis¬ 
ter  within  the  current  window.  Some  registers  belong  to  two  windows  but  have 
different  numbers  in  each  one  of  them;  they  are  called  "overlap  registers". 
Other  registers  belong  to  a  single  window  and  are  called  "locals”.  The  scheme 
works  regardless  of  the  numbering  sequence  in  each  window,  as  long  as  all  win¬ 
dows  have  the  same  sequence.  Figure  3.2.1  (a)  shows  a  small  register  file  with 
two  overlapping  windows.  In  addition,  RISC  I  &  II  also  have  some  registers,  — 
called  "global"  and  not  shown  in  fig.  3.2.1  —  which  belong  to  all  windows  and  have 
the  same  number  in  each. 

The  window  number  changes  every  time  a  procedure  call  is  executed.  Thus, 
every  procedure  activation  record  corresponds  to  a  different  window  (overflows 
are  dealt  with  in  the  next  sub-section).  The  compiler  allocates  the  local  scalar 
variables  of  procedures  into  the  "local"  registers,  so  that  no  other  activation 
record  (window)  has  access  to  them.  Thus,  saving  and  restoring  the  registers  on 
call  and  returns  is  not  necessary.  Local  non-scalar  variables,  as  well  as  scalar 
ones  for  which  there  are  no  registers  available,  are  allocated  on  the  execution 
stack  in  main  memory,  as  usual. 

The  windows  are  organized  in  a  stack.  Parent  and  child  procedure  pairs  are 
thus  given  adjacent,  Le.  overlapping,  windows.  The  compiler  allocates  the  scalar 
arguments  of  procedures  into  the  "overlap"  registers.  These  registers  appear 
•with  one  fixed  numbering  to  all  parent  procedures  ("outgoing-argument" 
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Figure  3.2.1:  Overlapping  Fixed-Size  Windows. 


registers),  and  with  another  fixed  numbering  to  all  child  procedures 
("incoming-argument"  registers).  In  preparation  for  a  procedure  call,  the 
parent  writes  the  actual  arguments  into  the  former  registers  of  the  "current" 
window,  and  the  child  has  them  available  in  the  latter  registers  of  its  own  win¬ 
dow.  Thus,  the  overlap  of  windows  allows  for  arguments  to  be  passed  in  regis¬ 
ters.  These  same  "overlap"  registers  are  also  used  for  saving  the  return-PC,  and 
for  returning  values  from  child  to  parent  procedure. 
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In  this  scheme  windows  and  overlaps  have  a  fixed  size,  which  allows  the  sim¬ 
ple  and  fast  AND-OR  decoding  shown  in  figure  3.2.1  (a)  (see  sect.  6.2  for  a  discus¬ 
sion  of  this  point).  In  RISC  I  the  overlap  sections  contain  4  registers,  the  local 
sections  contain  6  registers,  and  there  are  IS  global  registers.  In  RISC  II  over¬ 
laps  have  6  registers,  locals  have  10  registers,  and  there  are  10  global  registers. 
The  register  numbering  for  RISC  II  can  be  found  in  Appendix  A.  That  numbering 
is  such  that  overlap  registers  appear  in  their  two  windows  with  5-bit  numbers 
that  differ  in  only  one  bit  position.  The  special  NMOS  decoder  of  figure  3.2.1  (b) 
is  then  possible,  which  is  significantly  faster  than  the  general  OR-AND-INVERT 
decoder.  This  observation  was  made  after  RISC  II  was  submitted  for  fabrication, 
so  the  circuit  was  not  used  in  the  actual  chip. 

3.2.2  Circular-Buffer  Organization. 

The  absolute  procedure  nesting  depth  is  virtually  unbounded.  The  number 
of  register  windows  physically  present  in  a  CPU  chip  must  be  not  only  bounded 
but  also  quite  small.  Locality  of  nesting-depth  refers  to  the  relative  depth 
changes  of  procedure  nesting  during  a  limited  time  interval,  and  implies  that  its 
fluctuations  around  a  certain  depth  are  fairly  small.  The  CPU  register  windows 
are  used  for  the  few  most  recent  activation  records,  for  the  top  of  the  nesting 
stack.  Older  activation  records  may  have  to  be  saved  in  memory,  when  the  nest¬ 
ing  depth  increases,  and  the  windows  which  they  occupy  need  to  be  re-used  for 
younger  procedures.  Later  on,  as  the  depth  decreases,  these  records  have  to  be 
restored  into  the  register  file  windows.  The  actual  organization  of  the  CPU  win¬ 
dows  is  not  an  infinite  stack,  but  rather  a  circular  buffer  for  the  top  of  that 
stack  only.  The  rest  of  the  stack  is  maintained  in  memory. 

Figure  3.2.2  illustrates  the  circular-buffer  organization.  Two  pointers  are 
used  to  keep  track  of  empty  and  occupied  windows.  The  Current-Vindow-Pointer 


Figure  3.2.2: 
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(CWF)  points  to  the  window  of  the  currently  active  procedure  ("window-number" 
in  flg.  3.2.1).  The  Saved- Window-Pointer  (SWP)  identifies  the  youngest  window 
that  has  been  saved  in  memory.  In  the  example  of  figure  3.2.2  a  register  file 
consisting  of  8  windows  is  shown,  with  four  of  them  being  currently  occupied. 
For  grouping  and  identification  purposes,  the  overlap  registers  are  shown  as 
belonging  to  that  window  in  which  they  constitute  input  arguments.  This  group¬ 
ing  is  important  for  the  discussions  that  follow.  The  variables  that  are  kept  in 
overlap  registers  are  only  visible  by  the  child  procedure  in  the  High-Level- 
Language.  Only,  the  child  has  a  name  for  them  and  can  reference  them  in 
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statements  (in  languages  with  up-level  addressing,  all  further  descendants  can 
also  do  so).  In  contrast,  these  items  are  not  even  variables  in  the  parent  pro¬ 
cedure,  they  are  merely  expression  values,  and  no  other  statements  in  that  pro- 
cedme  can  reference  them.  This  discussion  holds  true  for  arguments  passed  by 
value  or  value-return;  for  arguments  passed  by  reference,  the  overlap  registers 
actually  contain  pointers  to  the  arguments. 

If  procedure  D  in  figure  3.2.2  wants  to  call  a  procedure  E,  it  writes  the  argu¬ 
ments  of  E  in  its  outgoing-argument  registers  (in  the  overlap  of  w.3  with  w.2), 
and  then  it  executes  a  cedi  instruction.  Call  instructions  move  CWP  by  one  win¬ 
dow  in  one  direction  (decrement,  modulo  6,  in  our  example),  while  return 
instructions  move  it  in  the  opposite  direction.  If  procedure  E  then  decides  to 
further  call  another  procedure  F,  that  call  cannot  proceed  with  the  current 
status  of  window  occupancy.  The  reason  is  that  F  could  not  write  into  its 
outgoing-argument  area  without  destroying  the  input-arguments  Ain  of  A. 
Furthermore,  some  registers  must  be  kept  free  at  all  times  for  use  by  the 
interrupt-handler  if  an  interrupt  occurs;  these  are  the  locals  of  w.l  in  our  case. 

At  this  point,  when  procedure  E  executes  a  call  instruction,  we  say  that  a 
register-file  overflow  has  occurred.  A  trap  is  generated,  stopping  the  call 
instruction  from  completing  execution.  The  criterion  for  the  generation  of  this 
overflow  trap  is:  when  a  call  instruction  attempts  fo  modify  CWP  so  that  it 
becomes  equal  to  SWP.  The  trap  gives  control  to  the  overflow-handler  routine, 
which  saves  one  or  more  windows  in  memory.  Tamir  and  Sfequin  have  concluded 
that  the  best  strategy,  for  most  practical  cases,  is  to  save  only  one  window  per 
overflow  trap  [TaSe83],  In  our  example,  the  overflow-handier  will  save  the  areas 
marked  A.in  and  Aloe  in  memory,  i.e.  only  part  of  w.l,  and  will  then  appropri¬ 
ately  move  SWP  to  the  start  of  B.in. 
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Similar  considerations  lead  to  the  criterion  for  generating  the  underflow 
trap:  when  a  return  instruction  attempts  to  modify  CWP  so  that  it  becomes 
equal  to  SWP.  Thus,  a  single  equality  comparator  circuit  is  enough  for  detecting 
both  over*  and  under-  flows. 
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In  summary,  an  N -window  register  file  can  hold  only  N-l  activation 
records.  (In  the  last  table  of  $  2.2.2,  figures  were  given  for  .W-l).  Interrupts 
always  modify  the  CWP  in  the  same  way  as  call's  do.  So,  interrupt-handlers  exe¬ 
cute  in  a  window  where  the  local  registers  are  guaranteed  to  be  free.  Interrupts 
should  not  be  allowed  to  nest  before  the  availability  of  more  windows  has  been 
checked. 

3.2.3  Pointers  to  Registers. 

There  are  cases  when  a  procedure’s  arguments  or  local  scalars  need  to  be 
accessed  by  a  pointer  or  by  one  of  its  descendant  procedures.  The  former  is 
true  in  languages  like  C,  where  the  programmer  is  allowed  to  ask  for  the  address 
of  a  scalar  variable  and  to  use  that  address  subsequently  as  a  pointer  for 
referencing  the  variable.  That  is  for  example  the  method  for  passing  return 
arguments  to  procedures  in  C:  scanff'Xd  Xd\n'\  A ri,  Aej).  The  latter  case,  refer¬ 
ences  to  locals  by  descendant  procedures,  appears  in  languages  with  up-level 
addressing,  like  Pascal.  It  is  usually  implemented  by  maintaining  a  display  of 
pointers  to  the  bases  of  the  activation  records  of  the  (static)  ancestors.  Thus, 
this  amounts  again  to  accessing  a  local  variable  via  a  pointer. 

In  this  context,  there  are  two  methods  to  allocate  local  scalars  to  registers. 
The  first  one  applies  only  to  languages  without  up-level  addressing.  A  two-pass 
compiler  is  used  to  recognize  the  variables  which  may  have  aliases  and  to  allo¬ 
cate  them  in  main  memory.  The  second  method  is  to  provide  means  for 
correctly  handling  pointers  to  registers.  The  RISC  architecture  specifies  the 
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latter  approach.  There  has  been  a  detailed  design  (data-path  and  timing) 
[KateSO]  for  the  implementation  of  pointers  to  registers  in  the  RISC  I  chip, 
resulting  in  no  lost  cycles.  However,  neither  the  RISC  I  nor  the  RISC  II  chips 
have  implemented  this  scheme,  because  of  lack  of  designer  time.  The  RISC/E 
design  (§  1.2)  does  include  the  handling  of  pointers  to  registers.  The  RISC  II 
micro-computer  design  contains  off-chip  circuitry  to  recognize  the  use  of 
addresses  pointing  to  registers  and  to  generate  an  interrupt  whenever  that 
occurs  [Liou83]. 

The  proposed  method  for  handling  pointers  to  registers  in  a  multi-window 
environment  will  be  described  here  in  a  general  form.  It  was  developed  in  colla¬ 
boration  with  D.  Patterson  in  September  1980. 

We  use  the  notion  of  "conceptual  window  stack",  a  conceptual  stack  in 
memory  consisting  of  a  virtual  image  of  the  window  frames  of  all  active  pro¬ 
cedures.  It  contains  one  word  of  storage  for  each  register  of  each  window  that 
has  been  "called"  and  did  not  yet  "return".  It  is  similar  to  the  conventional  exe¬ 
cution  stack  of  procedure  activation  records,  except  that  RISC  uses  frames 
(records)  of  fixed  size.  Figure  3.2.3  illustrates  this  conceptual  window  stack. 
For  clarity,  window  frames  are  shown  without  their  overlap  in  that  figure.  The 
overlap  registers  are  shown  as  belonging  only  to  the  window  in  which  they  consti¬ 
tute  input  arguments,  according  to  the  discussion  in  g  3.2.2.  Thus,  the  "frame 
size",  Sf,  shown  in  that  figure  is  equal  to  the  number  of  incoming-argument  and 
local  registers  in  one  CPU  register  window,  which  is  also  the  amount  by  which 
the  window  pointer  is  moved  on  each  call  or  return. 

All  but  the  top  of  the  conceptual  window  stack  is  actually  present  in 
memory  in  the  form  of  older  windows  saved  there,  as  discussed  in  §  3.2.2.  The 
top  of  that  stack  is  in  the  registers  on  the  CPU.  Figure  3.2.3  illustrates  this 
arrangement.  The  address  of  a  register,  at  a  given  time,  is  the  address  of  the 
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where  Sjr  is  the  size  of  the  window-frame  as  defined  above.  The  Current-Window- 
Pointer  ( CWP )  and  Saved-Window-Pointer  (SFi'P),  which  were  introduced  in  the 
last  subsection,  are  generalized  here  to  be  the  memory  addresses  pointing  to 
the  oame  frames  as  before,  but  now  in  the  virtual  image  in  the  conceptual  stack. 
They  thus  define  the  boundaries  of  the  portion  of  that  stack  which  is  currently 
kept  in  the  register  file. 

The  register  file  is  of  size  Spp  windows,  numbered  tun  =  0,  1 . Spjr-1.  By 

convention,  the  procedure  with  nd  =  0  is  given  the  window  tun=0.  Also  by  conven¬ 
tion,  each  procedure  call  decrements  the  current  tun,  modulo  Spp.  Since  each 
procedure  call  also  decrements  the  current  nd,  it  follows  that  the  window- 
number  tun  of  a  frame  with  nesting-depth  nd,  which  is  currently  in  the  register 
file,  is: 

tun  =  nd  mod  Spy  (2) 

The  register-number  m  of  a  register  in  a  CPU  window,  and  the  offset,  offs,  of  the 
corresponding  word  within  the  conceptual  frame,  measured  from  the  base  of  the 
frame,  are  related  by  an  arbitrary  but  fixed  one-to-cne  mapping: 

m  =  M(offs)  (3) 

That  mapping  is  defined  by  the  way  that  the  overflow-handler  routine  saves 
registers  in  memory  frames,  and  it  may  well  be  a  simple  linear  relation.  The 
compiler  must  know  that  mapping  in  order  to  generate  the  address  of  a  local 
scalar. 

The  memory  address  of  an  argument  or  local  scalar  variable  which  has  been 
allocated  into  register  m  can"  thus  be  computed  by: 

CWP  +  ATl(m)  (4) 

where  CWP  is  known  at  run-time  when  the  procedure  is  entered. 
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Now  assume  that  a  memory  reference  is  made  using  a  pointer  p  as  effective 
address.  Special  action  has  to  be  taken  if  and  only  if  p  is  pointing  to  a  register: 


CWP  *  p  <  SWP 


(5) 


If  that  last  condition  holds  true,  then  the  window-number  um(p)  and  the 
register-number  m(p)  of  the  register  where  p  is  pointing  to  must  be  deter¬ 
mined,  so  that  the  memory  reference  can  be  correctly  turned  into  a  register 
reference.  Let  A/(p )  be  the  base  address  of  the  conceptual  frame  where  p  is 
pointing  to.  Since  0  ^p-Af(p)  <  S ?,  and  since  Af  (p)-Abase  is  a  multiple  of  5 p 
(by  equation  (l)),  it  follows  that: 


(P  -At  (p )) +(Af  (p )  -Abase  ) 

Sr 


Af(p)-ABASE 

Sr 


(6) 


Combining  this  with  equation  (l),  we  get: 


nd(p) 


P  -Abase 

Sr 


(7) 


and  combining  with  equation  (2)  we  And  the  window-number  where  p  is  pointing 
to: 


wn(p) 


P  -Abase 

Sr 


mod  Sjtr 


(B) 


Finally,  to  get  the  register-number  m(p),  we  will  use  equation  (3)  and  the 
property  a  mod  b  -  a  -  [a/ 6]-6 : 


m(p)  =  M(  of fs(p)  )  -  H(p-Af(j>))  = 
=  H{p  -  base  ~ nd(p)  Sr  )  = 


=  U 


P  ~  Abase  ~ 


P  ~  Abase 


Sr 


SF 


-  if  (  (p - Abase )  mod  Sp  ) 


0) 
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All  this  complicated  arithmetic  reduces  to  trivial  bit-field  extractions  and 
concatenations,  when  the  pertinent  constants  are  powers  of  2.  Such  is  the  case 
in  RISC  II: 

RISC  II:  Abase  =  232 

S?  =  64  (bytes) 

Srr  -  B 

M(offs)  =  16  +  offs/  4  (byte  addresses). 

Under  th^se  circumstances,  the  important  equations  become: 

(Address  of  Local  or  input -arg.  in  Rn j  =  CWP  +  (n-16)-4  (4‘) 

im(p)  =  p<8:6>  (8') 

m(p )  =  ljjfp  <5:2>  (9’) 

where  F<m:n>  is  the  bit-field  extraction  operator,  and  F^FZ  is  the  bit-field 
concatenation  operator. 

Concerning  the  detection  of  pointers  p  addressing  registers,  equation  (5) 
says  that  two  full-address  comparisons  are  needed.  The  comparison  of  p  with 
SWP  is  required  in  order  to  decide  whether  p's  frame  is  currently  in  a  register 
file  window  or  has  been  saved  in  memory.  The  comparison  of  p  with  CWP  is 
required  in  order  to  decide  whether  p  is  pointing  into  the  conceptual  window 
stack  or  simply  to  something  else  in  memory.  However,  this  latter  condition  can 
also  be  checked  by  comparing  p  against  Ar.rurr : 

AuuiT  ^  p  <  SWP  (5') 

where  AuuiT  is  the  boundary  address  of  the  portion  of  virtual  memory  allocated 
for  the  window  stack  (see  bottom  of  fig.  3.2.3).  If  Ar.rurr  is  a  "convenient” 
hardwired  constant,  then  the  comparison  of  p  against  it  may  be  implemented 
with  very  simple  hardware.  In  RISC  II,  Ainrrr  -  and  that  comparison 

reduces  top  <31:24>=  11111111,  which  can  be  checked  with  a  single  NOR  gate. 
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If  the  window  size  or  register-file  size  are  not  powers  of  2,  then  the  proposed 
method  is  to  modify  the  definitions  of  Sp  and  of  nd  in  the  following  way:  S? 
should  be  defined  as  the  smallest  power  of  2  that  is  large  enough  for  a  frame  to 
fit  in.  The  nd  counter,  which  counts  down/up  on  every  call/return,  should  be 
made  into  a  conventional  counter  for  its  most-significant  bits,  coupled  with  a 
modulo-S/jp  counter  in  its  least-significant  bits.  This  will  waste  some  words  in 
main  memory,  where  the  window  stack  is  kept,  but  this  solution  is  preferable  to 
implementing  hardware  to  carry  out  integer  divisions  by  arbitrary  constants. 


3.3  The  RISC  I  &  II  Pipelines. 


The  pipeline  organizations  of  RISC  I  and  11  are  presented  here.  They  form 
the  basis  of  the  micro-architecture  of  these  two  implementations,  and  they  have 
even  influenced  the  original  definition  of  the  RISC  1  &  II  architecture.  RISC  1  has 
a  two-stage  pipeline,  while  RISC  II  has  three  stages.  The  issue  of  pipeline  suspen¬ 
sion  during  data  memory  accesses  is  discussed.  Other  possible  pipeline 
schemes  are  reviewed.  The  default  presence  of  an  addition  in  all  register-to- 
register  moves  and  in  all  addressing  modes  of  RISC  is  explained. 

3.3.1  Tvro  and  Three  Stage  Pipelines. 

Most  RISC  instructions  (sect.  3.1)  can  be  executed  within  the  same  amount 
of  time,  adhering  to  the  following  execution  pattern: 

•  read  and  FtZ  (or  get  PC  or  imm), 

•  perform  an  add/sub  or  logic  or  shift  operation  on  S 1  and  S 2,  and 


•  write  the  result  into  R^,  or  use  it  as  an  effective-address  for  a  memory 


access. 


Load  and  store,  the  only  instructions  containing  a  data  memory  access,  require 


an  additional  cycle  for  completing  Lheir  execution.  They  will  be  discussed  in  § 


3.3.2.  The  simple  execution  pattern  of  the  RISC  instructions  leads  to  simple 


pipeline  schemes.  Figure  3.3.1  shows  the  RISC  I  and  II  pipelines,  along  with  the 


resulting  utilization  of  the  data-path. 
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Figure  3.3.1:  The  RISC  I  and  II  Pipelines. 
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RISC  I  (fig.  3.3.1(a))  has  a  simple  two-stage  pipeline,  overlapping  instruction 
fetch  and  execution,  and  including  the  delayed-branch  scheme  (section  3.1.3). 
It  is  assumed  that  an  instruction-fetch  memory  cycle  takes  roughly  the  same 
amount  of  time  as  a  CPU  read-operate-write  cycle.  Figure  3.3.1(b)  shows  the 
requirements  placed  on  its  data-path.  For  a  good  performance,  a  two-port  regis¬ 
ter  read  is  specified,  in  order  to  simultaneously  get  Rti  and  Rl2.  Next,  the 
operation  is  performed  onto  the  two  sources,  while  the  register  file  remains  idle. 
After  that  one  has  completed,  the  result  can  be  written  into  Rd,  while  the  opera¬ 
tional  unit,  now,  remains  idle.  For  an  efficient  NMOS  implementation  the  two 
read-busses  of  the  register-file  have  to  be  precharged  before  a  read  operation  is 
made.  In  order  to  reduce  the  cycle  time,  RISC  I  precharges  those  busses  in 
parallel  with  writing  R d.  Thus,  its  register  file  must  have  3  busses. 

In  RISC  II  a  third  pipeline  stage  was  introduced  (fig.  3.3.1(c)),  and  the  writ¬ 
ing  of  Rd  was  delayed  until  that  stage.  Internal  forwarding  is  used  to  resolve 
register  interdependencies  among  subsequent  instructions  in  the  pipeline:  Two 
equality  comparators  detect  the  conditions  R,uz  -  Rdji  or  R,zjz  -  Rdji .  When 
these  occur,  the  result  of  / l's  operation  is  automatically  forwarded  from  the 
temporary  latch  where  it  is  kept,  for  use  by  12,  in  lieu  of  the  stale  contents  of 

The  requirements  that  this  pipeline  scheme  places  on  the  data-path  are 
radically  different  from  the  previous  ones  (figure  3.3.1(d)).  Here,  the  register 
file  is  kept  busy  all  the  time.  It  performs  the  write  of  Rdj\  immediately  after 
the  reads  of  R,ijz,  Rizjz-  The  register-write  operation  and  the  precharging  of 
the  register-file  busses  are  done  in  parallel  with  the  ALU  or  shift  operation, 
instead  of  sequentially  after  it  as  the  two-stage  pipeline  requires.  This  results  in 
a  performance  gain,  part  of  which  can  be  spent  to  perform  the  precharging 
after  the  register-writing,  thus  allowing  the  use  of  a  two-bus  register  file.  The 
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more  compact  two-bus  register-cell  is  the  main  reason  why  the  RISC  II  chip 
could  pack  75%  more  registers  than  RISC  I  into  a  25%  smaller  area.  If  an  ALU  or 
shift  operation  takes  as  much  time  as  a  register-write  plus  precharging  the 
register  file  busses,  then  precharging  the  register  file  busses  after  the  write 
operation  will  result  in  no  performance  loss  relative  to  precharging  and  writing 
in  parallel,  with  a  three-bus  scheme. 

3.3.2  Pipeline  Suspension  during  Data  Memory  Accesses. 

The  RISC  I  and  II  CPU  chips  have  a  single  memory  port  and  assume  a  non- 
pipelined  memory.  This  means  that  only  one  memory  access  may  be  in  progress 
at  any  time.  As  a  result,  when  the  data  memory  access  of  a  load  or  store 
instruction  is  being  carried  out,  the  rest  of  the  pipeline  is  temporarily 
suspended,  because  an  instruction-fetch  access  cannot  be  processed  at  the 
same  time.  This  situation  is  illustrated  in  figure  3.3.2  (a). 

The  limitation  of  a  single  memory  port  is  quite  common  in  microcomputer 
systems.  It  is  the  result  of  pin  constraints,  of  the  presence  of  a  single,  non- 
pipelined  memory  bank,  and  of  the  absence  of  on-chip  cache(s).  However,  as 
|  6.3  suggests,  it  is  desirable  to  integrate  an  instruction  cache  on  a  RlSC-style 
CPU  chip,  once  the  technology  makes  that  feasible.  An  on-chip  cache  appears  as 
an  independent  memory  port  to  the  CPU,  whenever  a  miss  does  not  occur.  The 
CPU  then  effectively  sees  two  separate  memory  ports,  one  for  instructions  and 
one  for  data.  Thus,  it  is  appropriate  to  study  pipelines  which  are  not  suspended 
on  data  memory  references,  as  figure  3.3.2  (b)  shows. 

When  the  constraint  of  single  memory  access  per  cycle  is  removed,  the  data 
access  cycle  of  load  and  store  instructions  can  occur  in  parallel  with  the  com¬ 
pute  cycle  of  the  next  instruction.  For  store  instructions,  this  poses  no  problem 

of  data  dependency.  For  load  instructions,  however,  it  does.  The  compute  cycle 
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Figure  3.3.2:  Pipeline  Suspension  during  Data  Memory  Accesses. 


of  the  instruction  immediately  following  a  load  must  not  depend  on  the  value 
being  loaded.  This  condition  needs  to  be  checked  by  the  compiler,  which  may 
insert  a  no-op  if  no  useful  work  can  be  done  in  the  slot  after  the  load  instruction. 
Alternatively,  such  a  dependency  may  be  detected  by  hardware,  which  then 
suspends  the  pipeline  while  waiting  for  the  data  to  arrive  from  memory. 

If  the  register-file  can  only  handle  a  single  register-write  per  cycle,  as  in  the 
case  of  RISC  D,  a  dummy  pipeline  stage  has  to  be  inserted  into  all  instructions 
at  the  place  where  loads  perform  their  memory  access  (figure  3.3.2  (b)). 

An  advanced  pipeline  scheme  using  dual  memory  write-ports  certainly 
speeds  up  load  and  store  Instructions.  But  what  is  their  overall  impact  on  per¬ 
formance?  How  frequent  are  store  instructions,  and  how  frequent  are  load 
instructions  followed  by  a  computation  that  doesn't  depend  on  them?  John 
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Cocke  of  the  IBM  Watson  Research  Center  gave  the  following  numbers  regarding 
the  601  during  an  informal  discussion  [CockB3].  About  16%  of  all  executed 
instructions  (on  the  B01)  are  loads  followed  by  an  independent  computation,  and 
about  9%  of  all  executed  instructions  are  load'  followed  by  a  dependent  compu¬ 
tation.  As  far  as  data  memory  accesses  are  concerned,  the  B01  has  a  pipeline 
similar  to  fig.  3.3.2  (b).  but  with  dual-port  register  writes  and  no  dummy  stages. 
Cocke  gave  no  figure  on  the  percentage  of  store  instructions,  but  these  usually 
range  around  10%  (see  e.g.  sect.  2.2.1  [AlWo75]).  These  numbers  show  that 
about  one  quarter  of  all  execution  cycles  can  be  saved  in  the  B01  by  not 
suspending  the  pipeline  on  data  memory  accesses. 

However,  these  figures  concern  a  processor  with  no  register  urindows.  Such 
processors  need  to  access  variables  in  memory  or  save/restore  registers  more 
often  than  processors  with  register  windows.  RISC  programs  execute  fewer 
loads/stores.  In  three  measured  program  runs,  17%,  13%,  and  15%,  respectively, 
of  all  executed  instructions  were  load's  [PaSeBl,  fig.  15].  The  corresponding  per¬ 
centages  for  store  instructions  were  1%,  1%.  and  9%.  In  RISC,  the  instructions 
following  the  load’s  can  be  expected  to  depend  on  them  more  frequently  than  in 
other  architectures,  because  restoring  multiple  registers  from  memory,  upon 
procedure  returns,  is  much  less  frequent.  Thus,  the  percentage  of  RISC  execu¬ 
tion  cycles  that  can  be  saved  by  allowing  simultaneous  instruction-fetches  and 
data-memory-accesses,  can  be  estimated  to  be  in  the  range  of  10%. 

During  the  RISC  design  process,  another  possibility  was  considered.  If  no 
other  instruction  can  be  fetched  for  execution  during  the  data-memory-access 
cycle,  one  could  try  to  pack  more  information  into  the  load/store  instruction 
itself,  so  that  the  CPU  can  do  something  useful  during  the  above  cycle.  As  an 
example,  a  third  instruction  format  could  be  introduced,  where  the  short- 
S0URCE2  of  figure  3.1.1  would  be  spit  into  a  9-bit  source-2,  52,  and  a  5-bit  E, 3 
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specifier.  Load  and  store  instructions  having  this  format  would  perform  the  fol- 
lowing  operations  during  their  two  execute-cycles: 

•  eff_gddr  4-  +  52 

•  Ri  *--»  A/[e//_jidclr];  compare-i-set-CC’s:  Pn—Pis 

These  instructions  could  be  used  for  implementing  combinations  of  HLL  state¬ 
ments  such  as: 

c  =  *p  :  if  (  p  >=  limit ) .... 

Such  combinations  are  quite  rare,  however.  For  example,  in  the  critical  loops  of 
section  2.4,  there  are  some  program  segments  in  procedure  gline()  of  sed 
(2.4.2)  that  come  close  to  the  above;  however,  still  none  of  them  is  suitable  for 
this  optimization.  In  any  case,  the  Inclusion  of  instructions  like  the  above  would 
lengthen  the  basic  processor  cycle-time,  because  additional  register-number 
latches  and  multiplexors  would  be  required  in  the  critical  path  of  register- 
number  decoding  (section  4.2).  Thus,  such  instructions  were  not  included  in  the 
RISC  architecture. 

3.3.3  Other  Pipeline  Schemes,  and  the  Issue  of  Default  Addition. 

More  pipelining  than  what  RISC  II  has  is  possible  in  RISC-like  register-to- 
register  architectures.  Figure  3.3.3  compares  the  3-stage  RISC  II  pipeline  (a), 
with  the  4-stage  B01  pipeline  (b)  [Cock83].  The  4-stage  pipeline  pushes  the 
data-path  utilization  as  far  as  data-dependencies  permit.  The  result  of  an  arith¬ 
metic,  logic,  or  shift  operation  may  be  used  as  a  source  for  the  operation  of  the 
next  instruction  as  soon  as  It  becomes  available  (arrow  (1)  in  fig.  3.3.3(b)).  In 
order  to  avoid  doubly-delayed  jumps  (arrow  (2)  in  the  same  figure),  the  801  per¬ 
forms  the  addition  of  the  PC  with  the  immediate  offset,  for  PC-relative  branches, 
in  parallel  with  the  source-register  reads  (arrow  (3)).  Of  course,  the  4-stage 
pipeline  places  heavier  requirements  on  the  register  file  and  on  the  instruction- 
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(c)  parallelism  and  the  "move”  instruction. 


Figure  3.3.3:  Various  Pipelines. 


In  the  Berkeley  RISC  architecture  there  is  no  register-to*register  move 
instruction.  It  is  synthesized  by  executing  R*  «-  R#  +  0,  thus  performing  a 
dummy  ALU  or  shift  operation  which  by  default  exists  in  every  RISC  instruction. 
This  architectural  decision  can  be  explained  on  the  basis  of  the  pipeline  organi- 
zation.  In  RISC  I  (fig.  3.3.1(b)),  move  instructions  with  no  ALU/shift  operation 
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could  execute  about  30%  faster.  However,  the  corresponding  shorter  cycles 
would  make  timing  irregular  and  introduce  significant  implementation 
difficulties.  Furthermore,  they  would  require  a  30%  faster  instruction-fetch 
mechanism,  even  though  this  increase  in  speed  could  not  be  exploited  during 
the  rest  of  the  cycles.  Refering  to  figure  3.3.3,  it  can  be  seen  that,  in  RISC  II  (a), 
and  in  the  4-stage  pipeline  (b),  removing  the  op  part  from  one  instruction  exe¬ 
cution,  would  yield  no  performance  gain  either,  as  long  as  the  pipeline  is  limited 
by  instruction-fetches  and  by  register-file  accesses.  In  part  (c)  of  the  figure,  a 
pipeline  is  shown  which  could  exploit  the  available  parallelism  in  the  case  of  a 
move  instruction.  Two  instructions  would  have  to  be  fetched  and  executed 
simultaneously.  The  MIPS  processor  [Henn83]  does  allow  two  instructions  to  be 
packed  in  one  word  and  be  fetched  from  memory  simultaneously.  However, 
each  major  execution  cycle  (pipeline-step)  contains  two  minor  cycles.  When  two 
register-to-register  instructions  are  packed  together,  they  are  in  fact  executed 
sequentially  —  each  during  one  minor  cycle  —  because  of  the  lack  of  sufficient 
hardware  resources  to  support  more  parallelism. 


Another  related  issue  is  the  single  addressing  mode  of  RISC  1  &:  II,  which 
always  performs  an  addition  when  computing  the  eflective-address,  regardless  of 
whether  it  is  needed  or  not.  References  such  as  mere  pointer  indirections  (*p) 
are  executed  by  R&  *— *  M[RP  +  0],  and  the  addition  is  wasted.  The  reasons  for 
this  architectural  decision  again  have  to  do  with  the  pipeline  scheme  and  with 
the  single  memory-port.  Figure  3.3.4  (a)  shows  the  execution  of  a  load  instruc¬ 
tion  in  the  RISC  II  pipeline.  If  its  address  calculation  requires  no  addition,  then 
the  data-memory-read  operation  could  be  performed  half  a  cycle  earlier,  as 
shown  in  part  (b).  This  would  allow  the  next  instruction  to  start  executing  one 
half  or  even  one  full  cycle  earlier.  However,  the  data  memory  access  would  have 
to  overlap  with  the  instruction  fetch  accesses,  something  which  RISC  I  &  II  can¬ 
not  do,  as  discussed  in  §  3.3.2.  Section  6.4.1  will  show  how  the  timing  of  fig. 
3.3.4(b)  is  possible  including  the  addressing  addition  if  a  data-cache  is  used. 

Instead  of  trying  to  start  the  memory  access  and  the  next  instruction  ear¬ 
lier  than  normal,  some  other  useful  work  could  be  done  in  lieu  of  the  unneces¬ 
sary  address  addition.  Examples  of  what  these  modified  load/store  instructions 
could  do  are: 

•  choice  of:  e / f_p.ddr  =  Rtl,  or  ef  f_p.ddr  =  Rtl+SZ,  and 

•  optionally  compare-ic-set-CC's:  Rtl±SZ,  or 

•  optionally:  Ral  «-  Rai+SZ. 

The  former  option  of  comparing  and  setting  the  CC's  is  similar  to  the  variation 
that  was  examined  at  the  end  of  §  3.3.2.  The  latter  option  of  modifying  Rti  is 
similar  to  the  auto-increment/decrement  modes  of  the  PDP-11  and  VAX-11 
architectures.  Shustek  [Shus7B]  found  that  about  15  7,  of  the  statically  used 
addressing  modes  on  the  PDP-11  are  auto-inc/dec  modes,  but  he  also  found  that 
some  of  them  were  merely  used  to  increment/decrement  the  register  without 
using  the  accessed  memory  word.  The  auto-inc/dec  modes  are  well  suited  for 


stack  accesses,  which  the  PDP-11  uses  a  lot  because  it  has  few  registers  (only 
32%  register-mode  usage,  in  the  above  study).  RISC,  on  the  contrary,  performs 
many  fewer  memory  references,  thus  the  gain  which  auto-inc/dec  mode  could 
bring  is  limited.  Besides,  a  complication  would  arise  with  these  modes.  Writing 
into  R,i  would  have  to  occur  during  the  data-memory-access  and  if  that  access 
leads  to  a  page-fault  interrupt,  Rti  would  already  have  been  modified.  We  pre¬ 
ferred  not  to  implement  any  of  those  modified  load/store  instructions,  in  order 
to  stay  with  a  clean  and  simple  architecture. 


3.4  Evaluation 

of  the  RISC  I  8c  II  Instruction  Set. 

In  this  section  we  evaluate  the  Berkeley  RISC  architecture  from  various 
points  of  view.  First,  its  most  controversial  part,  the  reduced  instruction  set  is 
considered.  We  discuss  its  appropriateness  for  a  High-Level-Language  (HLL) 
computer,  its  impact  on  code  size,  and  its  effect  on  machine  performance.  The 
overall  machine  is  then  evaluated,  taking  into  consideration  the  large  multi¬ 
window  register  file,  the  reduced  design  time,  and  the  elimination  of  design 
errors. 

3.4.1  Instruction  Set  and  High-Level-Languages. 

RISC  instructions  have  some  similarity  to  the  micro-instructions  on  typical 
micro-programmed  machines,  some  of  which  will  be  further  discussed  in 
chapter  4: 


•  All  instructions  have  the  same  width,  and  most  of  their  fields  have  fixed 
size  and  position  (3.1.4). 

•  All  instructions  execute  in  the  same  amount  of  time  (except  for  the 
"minor"  irregularity  of  pipeline  suspension  during  loads/stores)  (3.3). 

•  All  instruction  follow  a  similar  and  fixed  pattern  of  execution  in  the 
data-path  (4.1). 

•  Delayed  branches  are  used  (3.1.3). 

•  The  instruction  decoder  is  so  simple  that  it  occupies  only  0.5%  of  the 
chip  area  (4.4). 

•  The  control  signals  which  sequence  the  execution  of  instructions  in  the 
data-path  are  generated  by  some  simple  gates  that  occupy  just  1%  of  the 
chip  area  (4.4). 


Based  on  these  similarities,  some  people  argue  that  the  RISC  instruction  set  is 
"of  too  low  a  level  for  a  High-Level-Language  computer". 

However,  several  frequent  HLL  statements  are  compiled  into  only  a  single 
or  a  few  RISC  machine  instructions.  Here  are  some  examples  from  the  critical 
loop  of  fgrep  (§  2.4.1): 


HLLfstatement^ 
if  (— ccount<*0) 


RISC  machine  instructions: 

SUb-ic-set-CC's:  Recount  *- Recount 
jump-if-less-or-equal 


c->inp  ==  *p  load:  Rtx*-M\Rc^OFFS^xp'\ 

(from  ccompQ  macro)  load:  /?tz*-A/[R,p+0] 

sub-ic-set-CC’s:  Ro*~Rti—Rt 


c  =  c->fail 


load:  Rc  *-M[R:  +OFFS 


Thus,  RISC  machine  instructions  are  not  far  away  from  some  very  frequent  HLL 
statements.  We  could  even  argue  here  that  the  variants  of  the  load/store 
instructions  which  we  examined  at  the  end  of  section  3.3.2  and  3.3.3  actually 
correspond  to  two  HLL  statements  each. 

The  topic  of  High-Level-Language  computers  (HLLC)  has  attracted  much 
Interest  among  computer  architects  and  programmers  during  the  last  two 
decades.  Some,  view  a  HLLC  as  a  machine  that  should  reduce  the  "semantic 
gap"  between  HLL’s  and  machine  '“ode.  However,  Ditzel  and  Patterson  argue 
that  there  is  no  obvious  justification  as  to  why  this  should  be  a  desirable  goal 


[DiPa80].  Instead,  they  define  a  HLLC  as  one  where  all  programming,  debugging, 
and  error  reporting  takes  place  in  a  HLL,  so  that  the  user  need  not  be  aware  of 
the  existence  of  the  machine  language.  Thus,  whether  an  instruction  set  is  close 
to  micro-code  or  close  to  HLL  statements  is  irrelevant  to  the  issue  of  HLLC. 
What  is  important  is  whether  a  compiler  and  a  symbolic  debugger  can  be  built 
for  a  particular  architecture,  and  how  fast  compiled  HLL  programs  run. 

Writing  compilers  for  RISC  has  proven  quite  easy,  because  the  instruction 
set  provides  simple  and  straightforward  primitives  for  synthesizing  HLL  func¬ 
tions.  Johnson's  Portable  C  Compiler  (PCC)  and  a  peephole  optimizer  have  been 
modified  in  less  than  8  person-months,  to  produce  code  for  RISC  [CampBO]. 
Mtros  also  produced  another,  more  solid,  C  compiler  for  RISC,  again  modifying 
the  PCC  [Miro82].  This  RISC  PCC  had  one  third  less  code-table  entries  than  the 
comparable  VAX-11  PCC. 

Other  measures  can  be  used  to  show  that  RISC  is  no  less  a  High-Level- 
Language  architecture  than  are  other  favorite  processors.  Campbell  [CampBO] 
gives  the  static  number  of  machine  instructions  in  12  C  programs,  compiled  and 
optimized  for  the  RISC,  for  the  VAX-11,  and  for  the  PDP-11.  Relative  to  the  VAX- 
11  code,  the  PDP-11  code  has  40?!  more  instructions,  and  the  RISC  code  has  67?! 
more  instructions,  on  the  average.  This  shows  that,  although  RISC  instructions 
do  contain  "less  information"  than  VAX  or  PDP  instructions  and  could  thus  be 
considered  "lower  level",  the  difference  is  not  at  all  that  dramatic. 

Figure  3.4.1  contains  some  performance  measurements  of  5  programs  with 
no  procedure  calls,  published  in  [PaPi82]  and  adjusted  here  for  the  measured 
330nsec  and  500nsec  cycle  times  of  two  scaled  versions  of  RISC  II  chips  (§  5.2). 
Execution  times  of  C  and  Assembly  versions  of  5  programs  are  given,  normalized 
relative  to  the  C  version  on  a  500nsec  RISC  II.  We  will  come  back  to  these  perfor¬ 
mance  measurements  in  §  3.4.3.  Of  interest  here  is  the  ratio  of  the  assembly- 
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Fig  ire  3.4.1:  Normalized  execution  time  of  5  EDN 
benchmarks  ("without  procedures),  on 
5  machines,  in  C  and  in  Assembly. 


s» 


Cl 


a 


a 


t! 


code  execution-time  to  the  compiled-code  execution-time  on  the  various 
machines.  The  averages  of  the  corresponding  ratios  are  as  follows: 
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This  ratio  is  a  measure  of  the  loss  in  performance  due  to  programming  in  a  HLL 
rather  than  in  assembly  language.  The  lower  this  ratio,  the  more  the  program¬ 
mer  is  tempted  to  write  assembly  code.  Using  this  measure,  RISC  is  the  best 
HLL  architecture  among  the  ones  examined  above  [PaPi82].  It  can  be  seen  that 
compilers  have  difficulties  to  make  effective  use  of  the  complex  instructions 
that  other  processors  provide. 


3.4.2  Instruction  Set  and  Code  Compactness. 

An  instruction  set  and  an  instruction  encoding  that  achieve  compact  code 
are  desirable  for  two  main  reasons.  Firstly,  they  allow  the  computer  system  to 
have  smaller  memory  devices  for  holding  the  same  amount  of  compiled  pro¬ 
grams.  Memory  devices,  here,  are  disks,  mean  memory,  and  (instruction)  cache. 
By  being  smaller,  these  devices  can  be  faster  and  cheaper.  Or,  alternatively, 
memory  devices  of  the  same  size  can  hold  more  compiled  code.  Secondly,  when 
the  machine  code  is  more  compact,  less  bandwidth  is  necessary  for  fetching 
instructions  into  the  CPU  at  the  desired  rate.  This  allows  busses  to  be  cheaper. 
Alternatively,  the  same  bandwidth  will  allow  faster  fetching  --  and  thus  faster 
execution  —  of  the  compiled  program. 


However,  in  several  actual  situations,  toe  above  effects  may  be  weak,  while 
achieving  code  compactness  may  be  expensive  in  other  ways.  There  are  two 
main  methods  for  reducing  the  average  code  size.  Firstly,  an  instruction  format 


closer  to  Huffman  encoding  may  be  utilized.  This  means  having  a  variable 
number  of  fields  in  the  instructions,  and  possibly  having  the  fields  encoded  with 
variable  sizes.  The  choices  are  made  according  to  the  relative  frequency  of 
usage  of  instruction  and  field  types.  Secondly,  frequent  sequences  of  related 
primitive  operations  can  be  made  into  single  instructions.  This  allows  the  elimi¬ 
nation  of  fields  specifying  intermediate  results,  or  of  multiple  fields  specifying 
common  operands.  It  reduces  the  number  of  instructions  that  need  to  be 
fetched. 


Instruction  encoding  and  combining  must  be  done  carefully,  to  avoid  some 
possible  negative  effects  on  CPU  performance  and  cost.  The  circuitry  that 
decodes  instructions  and  controls  their  execution  can  become  large  and  costly 
if  the  instruction  encoding  is  too  complicated.  Performance  can  be  severely 
impaired  if  decoding  the  instruction  must  involve  serioi  rather  than  parallel 
operations.  The  extraction  and  interpretation  of  critical  instruction  fields 
should  not  depend  on  previous  complex  decodings  of  other  fields.  Also,  instruc¬ 
tions  that  require  a  long  execution  time,  with  many  intermediate  results,  may 
necessitate  the  inclusion  of  too  many  latches  into  the  data-path.  This  may  slow 
down  execution,  due  to  increased  capacitive  loading,  and  may  render  interrupt 
handling  awkward  and  slow. 


Trying  to  improve  performance  by  compacting  the  machine  code,  in  order 
to  alleviate  the  instruction-fetch  bottleneck,  has  its  limitations.  Firstly, 
instruction-fetch  is  usually  overlapped  with  instruction  execution,  and  thus, 
reducing  the  instruction-fetch  time  beyond  the  available  overlap  brings  no  per¬ 
formance  gain.  Secondly,  unless  a  sophisticated  buffering  mechanism  is  used, 
fetching  an  instruction  takes  an  amount  of  time  equal  to 
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In  other  words,  instruction  pieces  that  are  narrower  than  the  bus  width  still 
require  a  full  cycle  to  be  fetched.  Furthermore,  instructions  of  integer  word- 
width  may  require  an  additional  fetch  cycle  if  they  are  not  aligned  on  word  boun¬ 
daries.  The  cost  of  an  instruction-buffering  mechanism  that  could  remedy  such 
problems  is  rarely  lower  than  the  cost  of  simply  increasing  the  width  of  the  bus 
and  of  the  memory  devices,  and  thus  achieving  the  same  fetch  rate  with  wider 
instructions  that  are  more  conveniently  aligned. 

All  these  considerations  convinced  us  to  stay  with  the  simple  instruction 
set,  and  with  the  two  fixed  and  regular  instruction  formats  of  section  3.1,  even 
though  they  are  somewhat  wasteful  in  code  size.  Instructions  are  word-aligned, 
and  their  width  is  always  one  word.  Thus,  exactly  one  cycle  -  the  minimum  pos¬ 
sible  -  is  required  to  fetch  any  instruction.  The  execute-cycle  of  instructions 
was  defined  to  perform  as  much  work  as  practically  possible  during  the  one 
cycle  that  it  takes  to  fetch  the  next  instruction.  As  a  result  of  the  simple 
instruction  format,  the  decoding  and  field-extraction  circuitry  is  trivial,  at  most 
1  or  2  %  of  the  chip  area  (§  4.3).  The  relevant  trade-offs  were  studied  carefully, 
such  as  long  constants  (§  3.1.4)  and  modified  loads/stores  (§  3.3.2  and  §  3.3.3). 
However,  in  all  cases,  the  simpler  solution  looked  better.  After  all,  memory 
costs  are  decreasing,  and  "wasting”  memory  is  quite  common.  For  example,  a 
full  word  (32  bits)  is  usually  allocated  for  every  integer,  regardless  of  its  actual 
range. 

Even  though  RISC  has  such  a  simple  instruction  set  and  instruction  format, 
its  average  code  size  is  only  modestly  larger  than  that  of  other  processors: 
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,  ■/ -.'V  -.•  •• 


New 


Code  Size  Relative  to  RISC 


Machine: 


RISC  1.  II 
VAX- 11/780 
M6B000 
Z6002 
PDP-11/70 
BBN  C/70 


1.0 

0.8  ±0.3 
0.9  ±0.2 
1.2  ±0.6 
0.9  ±0.4 
0.7  ±0.2 


1.0 

0.67  ±0.05 


0.71  ±0.12 
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We  see  that  RISC  code  is  usually  not  more  than  50%  larger  than  the  rather  com¬ 
pact  VAX-11  code. 

Garrison  and  VanDyke  have  studied  how  much  RISC  code  size  could  be 
reduced  by  encoding  the  same  instruction  set  with  variable-length  fields  and 
instructions  [GaVDBl].  Their  results,  which  are  also  reported  in  [Patt83],  indi¬ 
cated  that  the  following  savings,  relative  to  the  present  RISC  format,  are  possi- 


Huffman  encoding  (4  to  67  bits/instr.)  43  %  savings 
B-.  16-,  24-,  and  32-  bit  instructions  35  %  savings 

18-  and  32-  bit  instructions  30  %  savings 


The  last  encoding  is  done  by  introducing  half-word  encodings  for  7  special  cases 
of  existing  RISC  instructions.  This  simple  encoding  certainly  brings  RISC  code 
size  into  the  same  range  as  code  for  other  popular  processors.  Patterson  et.al. 
investigated  the  use  of  this  encoding  in  connection  with  the  RISC  II  Instruction- 
Cache  chip  [Patt83]. 


3.4.3  Instruction  Set  and  Machine  Performance. 


Von  Neumann  computers  get  high  performance  either  from  fast  circuit 
technology  or  by  exploiting  fine-grain  parallelism.  The  latter  can  be  achieved  in 
several  ways.  One  is  the  "special-case"  method.  Some  frequent  combinations  of 
primitive  operations  are  detected  by  the  architect  and  are  made  into  single 
instructions.  Then,  the  micro-architect  tries  to  implement  these  instructions  in 
such  a  way  as  to  exploit  the  parallelism  available  among  the  primitive  opera¬ 
tions.  Another  way  is  the  "general-case"  method.  A  data-path  with  the  desired 
capabilities  and  cost  is  conceived  first.  Next,  the  architect  defines  simple 
instructions  that  describe  the  primitive  operations  available  on  the  data-path. 
Then  the  micro-architect  undertakes  to  pipeline  these  primitive  instructions  in 
such  a  way  that  they  constantly  keep  all  data-path  resources  busy. 

The  "special-case"  method  has  the  advantage  of  requiring  less  instruction- 
fetches  for  the  same  amount  of  work;  it  also  has  the  questionable  advantage  of 
allowing  better  exploitation  of  parallelism  since  the  particular  environment  of 
execution  is  better  known.  It  has  the  disadvantages  of  requiring  complex  con¬ 
trol  and  of  only  dealing  with  special  cases.  The  opposite  situation  holds  true  for 
the  "general-case”  method.  It  is  more  flexible  to  exploit  parallelism,  wherever  it 
is  available,  or  to  expose  the  machine  capabilities  and  to  allow  the 
compiler/optimizer  to  make  full  use  of  them.  Controlling  the  instruction  execu¬ 
tion  is  also  simpler.  Providing  reasonable  amounts  of  pipelining  is  not  very  hard, 
even  though  the  previous  and  subsequent  primitive  operations  are  not  known 
(see  chapter  4).  The  "general-case"  has  the  disadvantage  of  requiring  more 
instruction  fetches. 

Architectures  with  complex  instruction  sets  intend  to  get  high  performance 
using  the  "special-case"  method.  Reduced  instruction  set  computers  follow  the 
"general-case"  approach. 


3.4.3 


The  Berkeley  RISC  experiment  has  shown  that  the  differences  in  code  size 
between  the  two  methods  need  not  be  large  (§  3.4.2).  On  the  other  hand,  it  has 
shown  that  the  differences  in  size  and  complexity  of  the  control  circuitry  is 
large.  While  the  control  section  covers  50  to  60  7.  of  the  chip  area  in  the  M6B000 
or  in  the  ZBOOO,  it  only  covers  6  to  10  7,  of  the  RISC  I  or  II  chip  area  (see  §  4.4 
and  [FitzBl]).  We  believe  that  it  is  better  to  spend  hardware  resources  in  imple¬ 
menting  an  instruction-cache,  than  to  spend  them  in  implementing  complicated 
control  circuitry  with  a  big  micro-program  ROM.  The  reason  is  that  an 
instruction-cache  holds  the  instructions  that  are  dynamically  most  frequently 
used,  while  micro-storage  holds  the  statically  most  frequent  primitives,  or  — 
even  worse  —  some  rarely  used  complex  constructs.  In  RISC  I  &  II,  the  scarce 
chip  transistors  were  spent  to  implement  a  multi-window  register  file,  since  that 
one  has  even  higher  priority  than  an  instruction  cache  (see  chapter  6). 

t 

We  mentioned  above  that  the  ''special-case"  method  may  have  the  advan¬ 
tage  of  allowing  better  exploitation  of  parallelism  because  of  the  built-in 
knowledge  of  the  particular  execution  environment.  However,  the  opposite  may 
also  be  true.  Micro-programmers  sometimes  find  it  hard  to  correctly  optimize 
all  the  instructions  in  a  complex  architecture.  For  example,  the  VAX-1 1/7B0  has 
an  index  instruction  used  for  calculating  the  address  of  an  array  element,  and 
simultaneously  checking  whether  the  index  fits  within  the  array  bounds.  The 
same  task  can  be  performed  with  multiple  simpler  VAX-1 1/7B0  instructions  in  45 
%  less  time  [PaDiBO]  1  A  similar  case  for  the  IBM  370  is  reported  in  [PeSh77].  A 
sequence  of  load  instructions  is  faster  than  a  load-multiple  instruction  when 
fewer  than  4  registers  are  loaded. 

The  Berkeley  RISC  follows  the  "general-case"  method  of  pipelining  simple 
instructions.  Section  3.3  showed  how  the  memory  port  is  always  kept  busy  and 
how  the  register-file  and  the  ALU  of  RISC  II  are  kept  busy  all  the  time  except  for 
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the  cases  of  dummy  additions,  when  nothing  else  could  practically  be  done. 
More  hardware  resources  make  possible  the  exploitation  of  more  parallelism  in 
RISC-style  architectures.  Such  examples  were  given  in  3.3.2  and  3.3.3  for  the 
case  of  separate  instruction  and  data  memory-ports.  One  can  also  consider  the 
possibility  of  simultaneously  dispatching  multiple  simple  instructions  when  mul¬ 
tiple  functional  units  exist.  Figure  3.3.3(c)  offered  one  such  example.  A  propo¬ 
sal  for  parallel  dispatching  and  execution  of  unconditional-branch  and  of  other 
CPU  instructions  will  be  presented  in  §  6.3.6. 

Comparative  measurements  of  RISC  II  speed,  relative  to  that  of  other 
microprocessors  and  mini-computers,  have  shown  RISC’s  superior  performance 
[PaPi82]  (also  in  [PaSe02]).  For  some  of  the  processors,  including  RISC,  these 
were  collected  using  a  simulator.  The  average  performance  ratio  from  those 
studies  is  given  below,  after  being  adjusted  for  the  cycle  times  of  the  actual  RISC 
H  chips  (§  5.2): 


Machine: 

Basic 

Reg-to-reg 

Clock 

add 

RISC  II 

T= 330ns 
(12MHz) 

330ns 

RISC  11 

T=500ns 

(0MHz) 

500ns 

VAX-11/780 

5MHz 

400ns 

PDP-11/70 

7.5MHz 

500ns 

M80OOO 

10MHz 

400ns 

BBN  C/70 

8.7MHz 

7 

Z8002 

8MHz 

700ns 

Execution  Time 


Reg-to-reg  '  BMHz  RISC  II  Exec.  Time 

add  averaged  over 

_ _ _ 11  programs _ 


1.7  ±0.9 
2.1  ±1.2 
2.0  ±1.4 

3.2  ±2.2 

3.3  ±1.3 


Five  of  the  above  benchmark  programs  do  not  have  procedure  calls,  they  con¬ 
sist  of  one  single  function.  The  execution-time  ratio  for  these  programs  was 
given  in  figure  3.4.1.  They  show  that  the  performance  advantage  of  RISC  is  still 
present,  even  when  the  multiple  windows  of  the  register  file  are  not  used. 


More  extensive  performance  measurements  were  carried  out  by  Miros 
[MiroB2].  Ke  ran  the  VAX  C  compiler  on  both  the  RISC  simulator  and  on  the 
VAX-11/780.  The  compilation  of  three  programs  took  26  seconds  on  a  330ns 
RISC  11  (simulated),  38  seconds  on  a  500ns  RISC  II  (simulated),  and  50  seconds  on 
the  VAX-ll/780.  It  is  worth  noting  that  a  register-to-register  integer  addition 
takes  330ns  or  500ns  on  RISC  II,  while  the  VAX-ll/780  data-path  can  perform 
such  an  operation  in  one  200ns  micro-instruction,  even  though  the  execution  of 
a  register-to-register  add  instruction  takes  400ns  on  the  VAX-ll/780. 

3.4.4  Overall  Evaluation  of  RISC  I  &  II. 

An  overall  evaluation  of  the  Berkeley  RISC  architecture  must  include  the 
multi-window  register  scheme,  the  area  and  transistor  statistics  of  the  VLSI 
implementation,  and  the  human  effort  that  was  required  to  design,  layout,  and 
debug  the  chips. 

The  evaluation  of  the  multi-window  register  file  was  done  by  Halbert  and 
Kessler  and  was  reviewed  in  section  2.2.2  (with  some  further  discussion  in  sec¬ 
tion  3.2).  Here  are  two  more  measurements  from  [PaSe82]  for  illustration  pur¬ 
poses: 


Data-Memory-Traffic  due  to  CaX 

1  ‘s  and  Return 's: 

PUZZLE 

QUICKSORT 

VAX-1 1/7B0  ,  of 

440  K 
28% 

700  K 

50% 

words 

%  of  all  data-mem-ref. 

8  K 
0.8% 

4  K 

1% 

These  numbers  only  concern  the  data  traffic  due  to  calls  and  returns.  Further 
savings  in  memory  accesses  are  achieved  by  the  default  allocation  of  locals  into 
registers.  Thus,  one  realizes  the  dramatic  savings  in  memory  traffic  that  the 
multi-window  register  file  provides.  Section  6.1  compares  register  files  to  cache 
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memories,  while  section  8,2  examines  other  possible  organizations  for  them.  A 
study  of  the  trade-off  between  the  size  of  such  a  register  file  and  its  delays  due 
to  capacitive  loading  can  be  found  in  Sherburne’s  thesis  [Sher83], 

In  the  RISC  I  and  II  NMOS  microprocessor  chips,  the  traditional  allocation  of 
scarce  silicon  resources  has  been  radically  altered  owing  to  the  reduced  instruc¬ 
tion  set.  Control  circuitry  has  been  drastically  reduced,  and  the  silicon  area 
and  transistors  saved  were  used  for  the  large  register  file.  The  foregoing  evalua¬ 
tion  showed  how  the  reduced  instruction  set  leads  to  high  utilization  of  the 
data-path  hardware  by  the  executing  programs.  This  effect  is  amplified  by  the 
faster  basic  cycle  that  a  simple  data-path  achieves,  as  chapter  4  will  show.  The 
multi-window  register  file  further  enhances  performance.  The  overall  result  is 
what  we  consider  to  be  the  most  effective  utilization  of  the  scarce  VLSI 
resources  for  performing  general-purpose  computations.  And,  last  but  not  least, 
the  human  effort  required  to  design,  layout,  and  debug  the  processor  has  been 
reduced  by  almost  an  order  of  magnitude  relative  to  that  required  for  the  design 
of  other  microprocessors  (sect.  4.5).  This  reduces  costs  and  allows  faster 
exploitation  of  new  and  rapidly  changing  technologies. 

We  believe  that  these  points  prove  the  viability  of  Reduced  Instruction  Set 
Computer  architectures  for  general-purpose  VLSI  processors. 
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THE  RISC  II 
DESIGN 
AND  LAYOUT. 


This  chapter  deals  with  the  micro-architecture  of  the  RISC  II  chip.  After  a 
detailed  description  of  the  data-path  (§  4.1),  it  presents  the  fundamental  timing 
dependencies  and  the  particular  timing  scheme  chosen  (§  4.2)  as  well  as  the 
organization  of  control  (§  4.3)  and  some  design  metrics  (§  4.4).  The  estimates 
and  the  rationale  which  guided  the  major  decisions  are  discussed  and  compared 
with  the  picture  that  emerged  after  the  circuit  was  designed  and  laid-out. 


4. 1  The  RISC  II  Data-Path, 

and  its  Use  for  Instruction  Execution. 

This  section  presents  the  RISC  II  data-path  and  the  basic  trade-offs  which 
were  considered  during  its  design.  The  general  form  of  the  data-path  is  a  direct 
consequence  of  the  instruction  set  (sect.  3.1)  and  of  the  chosen  pipeline  scheme 
presented  in  section  3.3. 


A  very  compact  register  cell  was  essential  for  the  implementation  of  a  large 
register  file.  Robert  Sherburne  designed  and  laid-out  such  a  compact  2-bus 
register  cell  by  modifying  the  classical  6-transistor  static  cell  [SKPS82], 
[Sher83].  The  modification  allows  dual-port  read-accesses  with  single-bus 
signed-sensing,  but  requires  both  busses  for  a  write  operation.  Dual  read-ports 
and  a  single  write-port  perfectly  match  the  basic  RISC  instruction  pattern  of 
reading  two  registers  Ra i  and  Ra g,  and  writing  the  result  into  a  register  R 4.  The 
cell  requires  a  precharge  -  read  -  write  cycle,  which  guided  us  in  the  choice  of 
the  pipeline  scheme  (figure  3.3.l(c,d)).  This  cell  is  about  2.5  times  smaller  than 
the  3-bus  RISC  I  register  cell,  and  this  feature  constituted  the  main  driving  force 
for  the  development  of  RISC  II. 

An  arbitrary-amount  bidirectional  shifter  is  included  in  the  data-path,  as 
the  instruction  set  specifies.  This  was  designed  and  laid-out  by  the  present 
author.  It  consists  of  a  cross-bar  switch  made  out  of  pass-transistors  [SKPS82]. 
A  compact  and  versatile  lay-out  was  achieved  by  routing  one  data-bus,  R.  in  the 
horizontal  direction,  while  the  other  one,  L,  is  diagonal,  thus  providing  connec¬ 
tion  points  both  on  the  side  and  at  the  top  of  the  shifter  module;  the  control-bus 
is  vertical.  The  elementary  shifter  cell  is  a  bi-directional  pass-gate  that  is  used 
in  one  direction  for  a  left-shift,  and  in  the  other  direction  for  a  right-shift.  The 
shifter  busses  need  to  be  precharged  before  they  are  used. 

A  32-bit  integer  add/sub  ALU,  the  Program-Counter  circuitry,  and  a  pipe¬ 
line  latch  complement  the  basic  data-path. 

4.1.1  The  RISC  II  Data-Path. 

Figure  4.1.1  presents  the  RISC  II  data-path.  Its  basic  parts,  namely  latches, 
functional  units,  and  busses  are  the  following: 
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•  Register  File:  138-word  by  32-bit  register  file,  with  its  dual-port  address 
decoder  and  with  latches  RA,  RB ,  RD  holding  the  register  addresses 
(numbers)  from  the  instruction.  R0  is  hardwired  to  contain  zero. 

•  PSW:  13-bit  Processor  Status  Word.  It  includes  the  CWP  and  SWP  (sect. 
3.2),  the  condition  codes  (CCS),  and  interrupt  and  system  control  bits 
(Appendix  A). 

•  DST :  the  Destination  latch,  serving  as  the  temporary  pipeline  latch.  The 
result  of  each  cycle's  operation  is  kept  in  there,  until  that  result  is  writ¬ 
ten  into  the  register  file,  or  otherwise  used,  during  the  next  cycle. 

•  SRC:  Input  (Source)  latch  for  the  shifter.  DST  or  BI  are  used  as  the 
shifter’s  output  latch. 

•  Shifter:  The  32-bit  cross-bar  shifter.  The  amount  of  shifting  (0  through 
31)  is  specified  by  the  contents  of  the  SHam  latch  and  decoded  by  the 
shift-amount  decoder  (S.DEC).  A  right-to-left  shifting  occurs  when  infor¬ 
mation  flows  in  the  bi£Stf-*6usL  direction,  and  a  left-to-right  shifting  for 
the  opposite  direction. 

•  AI,  BI:  the  two  input  latches  of  the  ALU.  The  ALU  has  no  output  latch;  it 
uses  DST  or  busOUT  for  that  purpose  (busOUT  will  dynamically  hold 
information). 

•  ALU:  a  32-bit  integer  arithmetic  and  logic  unit.  It  may  perform  addition, 
subtraction,  bitwise  AND,  OR,  XOR,  or  pass  BI  to  the  output. 

•  BAR:  the  Byte-Address  Register,  which  computes  and  holds  the  2  least- 
significant  bits  of  the  sum  of  AI  and  BI.  In  those  cases  when  the  ALU  is 
computing  an  effective-address,  BAR  will  contain  the  part  of  the  address 
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which  specifies  the  byte-within-the-word  alignment. 

•  NXTPC:  the  Next-Program-Counter  register,  which  holds  the  address  of 
the  instruction  being  fetched  during  the  current  cycle. 

•  INC:  an  incrementer,  which  computes  NXTPC+A  (byte  addresses). 

•  PC:  the  Program-Counter  register,  which  holds  the  address  of  the 
instruction  being  executed  during  the  current  cycle. 

•  LSTPC:  the  Last-PC  register,  which  holds  the  address  of  the  instruction 
last  executed  -  or  last  attempted  to  be  executed.  When  an  interrupt 
occurs,  LSTPC  will  hold  the  address  of  the  interrupted  (aborted)  instruc¬ 
tion  during  the  first  cycle  after  the  interrupt. 

•  IMM:  the  Immediate  latch  to  hold  the  19  LS-bits  of  the  incoming  instruc¬ 
tion,  which  contain  its  immediate  constant  (if  it  has  one). 

•  DIMM:  the  Data.In/Immediate  combined  latch,  preceded  by  the  sign- 
extender/zero-filler.  It  holds  data  coming-in  from  memory,  or  immedi¬ 
ate  operands  being  forwarded  to  the  data-path. 

•  OP:  the  7-bit  opcode  of  the  instruction,  and  the  SCC  and  use-immediate 
bits  of  the  instruction  (bits  <31:25>,  <24>,  and  <13>  respectively;  see 
fig.  3.1.1). 

•  busA,  busB:  the  register-file  busses. 

•  busD:  the  bus  used  for  feeding  AI,  and  for  feeding  DST  from  the  right- 
hand  side  of  the  data-path. 

•  busR,  busL:  the  shifter’s  busses,  optionally  connected  by  the  bi¬ 
directional  cross-bar  shifter.  BusR  is  also  used  for  feeding  BI,  while 
busL  is  also  used  for  introducing  Data.In  and  immediate  constants  into 
the  data-path. 

•  busOUT:  the  bus  used  for  routing  addresses  and  data  to  the  pads,  and 
form  there  to  memory. 

•  busEXT:  The  ofl-chip  bi-directional  time-multiplexed  address/data  bus, 
which  connects  the  CPU  to  the  memory.  It  is  electrically  identical  with 
the  32  address/data  bonding-pads,  and  with  the  32  wires  running  in  the 
chip,  and  feeding  RA,  RB,  RD,  IMM ,  DIMM,  and  OP. 


The  next  subsection  explains  how  these  latches,  functional  units,  and  busses 
are  used  for  executing  the  instructions. 


4.1.2  Paths  Followed  for  Instruction  Execution. 

There  are  few  categories  of  activities  that  may  be  going  on  in  the  data-path 
during  each  cycle: 


•  The  appropriate  two  sources  51  and  52  are  routed  to  the  ALU  or  to  the 
shifter. 
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being  executed.  The  BI  input  of  the  ALU  is  loaded  from  busR.  That  bus  is 
driven  from  busB  -  through  SRC  -  when  52  is  a  register  or  the  PSW.  In  that 
case,  the  transistors  of  the  shifter  are  all  turned-off,  so  that  busR  is  discon¬ 
nected  from  busL.  When  52  is  an  immediate  constant,  busR  is  fed  from  DIMM , 
through  busL  and  through  the  shifter.  The  19-bit  1MM  latch  is  connected  to 
DIMM  in  such  a  way  that  it  feeds  the  19  MS-bits  of  DIMM ,  while  the  13  LS-bits  of 
DIMM  are  loaded  with  zeros.  When  the  instruction  contains  a  13-bit  immediate, 
the  sign-extender  converts  that  into  19  bits.  When  this  MS-aligned  19-bit 
immediate  goes  through  the  shifter,  it  either  stays  MS-aligned  for  load -high 
instructions,  or  it  is  right-shifted  by  13  and  sign-extended  (i.e.  LS-aligned),  for 
all  other  instructions. 


Figure  4.1.3  illustrates  how  the  appropriate  sources  are  routed  to  the 
Shifter.  For  shift  instructions,  the  quantity  to  be  shifted  is  R§1,  which  is  read  via 


buS/4  and  placed  into  SRC.  SRC  then  drives  busR  for  right-to-left  shifting,  or 
busL  for  left-to-right  shifting.  The  amount  of  shifting  is  52.  a  register  or  an 
immediate  constant.  Thus.  SHarn  is  loaded  with  the  5  LS-bits  of  IMM  or  of  bvsB 
which  carries  R, 

Alternatively,  the  quantity  to  be  shifted  may  be  data  to  or  from  memory, 
requiring  alignment.  In  that  case,  the  amount  of  shifting  (alignment)  is 
specified  by  the  BAR.  When  data/rom  memory  must  be  aligned  at  the  end  of  a 
load  instruction,  DIMM  serves  as  the  shifter's  input.  Notice  that  alignment  of 
incoming  data  requires  left-to-right  shifting  only.  When  data  fo  memory  must 
be  aligned  during  a  sfore  instruction,  that  data  comes  from  i?*  and  is  read 
through  busB  and  placed  into  SRC.  As  it  was  discussed  in  section  3.1.2,  RISC  II 
limits  the  addressing  modes  of  sfore  instructions  to  having  an  immediate  52.  so 
that  busB  of  the  register  file  can  be  used  to  read  the  data  at  the  same  time 
when  bus  A  is  used  to  read  the  index  register  Rtl,  and  when  busL  and  busR  are 
used  to  bring  the  immediate  to  BI. 

Figure  4.1.4  illustrates  how  the  output  of  the  ALU  or  Shifter,  or  the  PCs  are 
routed  to  DST,  and  later  written  into  their  final  destination.  This  final  destina¬ 
tion  for  most  instructions  is  R t,  and  the  PSW  for  the  putpsw  instruction.  Data 
to  be  written  into  those  destinations  may  originate  from: 

•  the  ALU,  for  arithmetic  and  logical  instructions,  or  for  the  putpsw, 
getpsw,  and  load -high  instructions.  The  last  two  instructions  use  the 
ALU  in  the  mode  where  BI  is  passed  intact  to  the  output. 

•  the  Shifter,  for  shift  or  load  instructions. 

•  the  PC,  for  call  instructions,  which  save  their  address  into  /?*  for  use  by 
the  re  bum  instruction. 

•  the  LSTPC,  for  the  colli  and  getlpc  instructions;  these  are  used  on  inter¬ 
rupts,  in  order  to  get  the  addresses  of  the  interrupted  instruction  and  of 
the  instruction  that  was  being  fetched  when  the  interrupt  occurred. 


Figure  4.1.4:  Paths  leading  through  DST. 
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BusD  is  used  to  route  the  ALU  output,  PC,  or  LSTPC  into  DST.  The  output  of 
the  shifter  comes  from  bits/?  or  busL,  depending  on  whether  it  was  a  right  or 
left  shift,  respectively.  DST  holds  those  results  of  execution,  until  the  appropri¬ 
ate  time  in  the  last  pipeline  stage  when  they  are  written  into  /?*  or  into  PSW. 
That  occurs  via  busses  A  and  B . 

Figure  4.1.5  illustrates  how  addresses  and  data  are  routed  to  memory  and 
how  the  three  PC  registers  work.  BusOUT  is  the  only  bus  used  for  sending  infor¬ 
mation  out  of  the  data-path,  and  it  is  driven  by  a  multiplexor  with  dynamic 
storage  capability.  Addresses  for  instruction-fetching  come  from  the  ALU  out¬ 
put  in  the  cases  of  a  successful  transfer  of  control,  and  from  fhe  NXTPC- 
incrementer  in  all  other  cases.  The  NXTPC  is  always  loaded  from  busOUT,  with 
whatever  address  is  sent  to  memory  for  instruction-fetching.  The  PC  and 
LSTPC  registers  follow  the  contents  of  NXTPC  with  a  delay  of  1  and  2  pipeline 
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stages,  respectively.  During  data-memory-access  cycles,  the  whole  PC-related 
circuitry  freezes  (see  sect.  3.3.2  on  pipeline  suspension). 

Addresses  for  data-memory-accesses  always  come  out  of  the  ALU.  Data  are 
sent  to  memory  during  sfore  instructions.  After  these  data  have  been  read  from 
Rd  and  aligned  (fig.  4.1.3),  they  are  temporarily  kept  in  DST  (fig.  4.1.4).  Then, 
DST  places  them  on  busA,  which  then  drives  busD,  and  from  which  they  are  put 
onto  bus  OUT  (figure  4.1.4). 

This  completes  the  description  of  the  paths  used  for  the  various  CPU  activi¬ 
ties.  The  complete  execution  of  an  instruction  is,  in  general,  a  combination  of 
some  transfer  from  figure  4.1.2  or  4.1.3,  followed  by  some  operation,  followed  by 
some  transfer  from  figure  4.1.4  and  some  from  4.1.5. 


4.1.3  Trade-offs  Considered  during  the  Data-Path  Design. 


To  a  large  extent  the  RISC  II  data-path  is  a  direct  consequence  of  the 
register-cell  used,  of  the  pipeline  scheme,  and  of  the  instruction  set  require¬ 
ments.  Some  of  its  important  characteristics,  however,  could  be  different. 
Here,  we  will  mention  some  alternatives,  and  we  will  give  the  reasons  for  our  par¬ 
ticular  choice.  These  choices  will  be  evaluated  in  the  next  two  sections.  For  a 
more  extensive  and  detailed  study  of  data-path  design  trade-offs,  and  one  that 
particularly  addresses  electrical  design  issues,  refer  to  Sherburne’s  thesis 
[Sher83]. 

One  trade-off  relates  to  the  way  immediate  constants  are  brought  into  the 
data-path.  The  shifter  is  also  used  for  that  purpose,  in  addition  to  its  main  func¬ 
tion  of  executing  shift  instructions  and  aligning  the  data  on  load/store  instruc¬ 
tions.  Shift  instructions  and  data  alignment  match  well  with  each  other, 
because  at  most  one  of  these  operations  occurs  in  any  one  cycle,  and  because 
both  occur  near  the  end  of  the  cycle.  However,  the  routing  of  the  immediates 
does  not  match  so  well  with  those  operations,  because  it  has  to  occur  at  the 
beginning  of  the  cycle,  and  because  it  may  occur  in  the  same  or  adjacent  cycle 
with  one  of  the  other  two  functions.  In  spite  of  that  non-optimal  match,  a  timing 
solution  was  found  and  implemented  in  RISC  II  (§  4.2.2).  As  a  consequence  of 
routing  immediates  through  the  shifter,  the  latter  was  placed  between  the  ALU 
and  the  register  file. 

There  are  two  possible  alternatives  to  the  above  scheme.  Immediates  could 
be  brought  into  the  data-path  from  the  right-hand  side,  using  an  extra  horizon¬ 
tal  bus.  This  would  increase  the  number  of  busses  crossing  the  PCs  and  the  ALU, 
which  would  cause  severe  problems  for  the  layout  of  those  densely  populated 
areas.  Otherwise,  immediates  could  be  brought  in  with  an  extra  vertical  bus  just 
on  the  left-hand  side  of  the  ALU.  Aligning  the  immediate  to  the  LS  or  MS  word- 


position  could  then  be  done  with  a  2-way  multiplexor  at  the  input  of  BI.  This 
solution  is  feasible,  but  it  was  not  chosen  because  it  requires  the  extra  space  for 
a  19-bit  vertical  bus.  The  horizontal  length  of  the  data-path  is  severly  limited  by 
the  desire  to  have  138  registers,  in  a  chip  limited  in  length  by  the  size  of  the 
package  cavity. 

Another  trade-off  relates  to  the  way  the  ALU  inputs  are  fed.  In  the  chosen 
scheme,  multiplexing  the  ALU  sources  occurs  on  the  busses  which  feed  AI  and 
BI.  In  this  way.  the  ALU  inputs  are  simple  latches,  and  the  register-file  busses  A 
and  B  do  not  have  to  extend  all  the  way  up  to  the  ALU  inputs.  This  latter  fact  is 
advantageous  because  it  reduces  bus  capacitance  thereby  speeding-up 
register-reads.  It  also  alleviates  the  heavy  bus  congestion  in  the  SRC  area.  The 
alternative  would  be  to  make  the  ALU  inputs  into  latches  with  multiplexors  and 
to  extend  various  busses  all  the  way  to  them.  This  scheme  would  allow  registers 
to  be  routed  directly  to  the  ALU  without  incurring  the  extra  delay  due  to  the  for¬ 
warding  from  one  bus  onto  another.  However,  it  has  all  the  disadvantages 
corresponding  to  the  advantages  of  the  chosen  scheme. 


4.2  The  RISC  II  Timing. 

This  section  is  concerned  with  the  timing  of  the  RISC  II  data-path.  It  starts 
with  a  discussion  of  the  fundamental  timing  dependencies,  as  implied  by  the 
instruction  set  and  the  pipeline  scheme,  regardless  of  a  specific  data-path. 
Then,  it  proceeds  to  examine  how  this  timing  was  cast  into  specific  clock  phases 
for  the  particular  data-path  that  was  chosen  (sect.  4.1).  Finally,  the  timing  pic¬ 
ture  that  emerged  after  the  data-path  was  laid-out  is  presented.  Discrepancies 
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between  the  three  above  timing  schemes  are  discussed  and  explained,  and  some 
conclusions  are  drawn. 

4.2. 1  Fundamental  Timing  Dependencies. 

Figure  4.2.1  is  an  abstract  timing-dependency  graph  for  the  RISC  II  pipeline 
(sect.  3.3.).  Arrows  represent  data-path  activities,  while  vertices  represent 
cause-effect  dependencies.  If  an  activity  Y  depends  on  an  activity  X  and  must 
follow  it  in  time,  then  the  arrow  representing  Y  starts  from  the  endpoint  of 
arrow  X. 

The  diagram  shown  in  fig.  4.2.1  assumes  no  knowledge  about  the  data-path, 
other  than  the  use  of  a  register  file  with  two  read-ports,  one  write-port,  and 
requiring  a  precharge  -  read  -  write  cycle.  One  counter-clockwise  revolution 
around  the  top  half  of  the  diagram  represents  the  main  activities  occurring 
inside  the  CPU  during  one  machine  cycle.  Equivalently,  one  clockwise  revolution 
around  the  bottom  half  represents  the  memory  cycle  occurring  in  parallel. 

Point  A  illustrates  that  an  ALU  or  Shift  operation  can  only  begin  after  its 
source-registers  have  been  read,  and  that  a  register-write  can  only  begin  after 
the  read  operation  has  been  completed.  Point  B  shows  that  the  result  of  an  ALU 
operation  can  be  used  as  an  effective  memory  address  for  a  data  access  or  for 
an  instruction  fetch.  Points  C,  D,  E  illustrate  a  memory  read.  When  this  is  an 
instruction  fetch,  then  the  path  E-»G  shows  that  the  source-register-number 
fields  of  the  instruction  must  be  decoded  before  the  corresponding  register  read 
accesses  can  start.  The  path  E-*F  stands  for  the  alignment  and  sign- 
extension/zero-fllling  needed  when  bytes  and  short-words  are  loaded  from  an 
arbitrary  memory  location  into  the  least-significant  position  of  a  register.  Thus, 
point  F  represents  the  result  of  the  second-to-last  pipe-stage,  which  is  to  be 
written  into  the  destination  register  during  the  last  pipe-stage  (point  A).  The 
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Figure  4.2.1:  Fundamental  Timing 

Dependencies  in  RISC  II. 


precharge  -  read  -  write  register-cycle  is  shown  as  cycle  H-*G-*A-*H. 

Figure  4.2.1  has  been  drawn  with  some  crude  notion  of  actual  time  dura¬ 
tions  in  it.  The  length  of  the  arrows  is  roughly  proportional  to  the  delay  of  the 
corresponding  data-path  activities.  Not  Included  is  an  estimate  of  the  routing 
delays  from  one  functional  unit  to  another,  because  no  specific  data-path  is 
assumed  at  this  point.  The  diagram  shows  points  E  and  B  separated  by  an  arbi- 
trary  amount  of  time  t  j.  Point  E  represents  the  end  of  a  memory-read  cycle, 
and  point  B  represents  the  beginning  of  the  next  memory  cycle.  Because  of  the 
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multiplexed  address/data  pins  and  because  of  the  non-overlapped  memory 
accesses,  point  B  must  be  after  point  E,  and  thus  must  be  >0.  Within  that 
constraint,  fj  is  an  arbitrary  design  parameter  that  specifies  how  much  the 
memory  access  time  (B-*C-*D-*E)  is  shorter  than  the  overall  system  cycle  time. 

The  diagram  shows  quite  clearly  that  the  internal-forwarding  path,  F-»A,  is 
not  critical.  The  critical  paths  lie  in  the  register  file  loop: 

(precharge  )HG  -*  (re ad) £4  -♦  (write  )A}i 

and  the  flgure-8-shaped  path: 

(decode  —register) zc  (read— register  )&  -*  (compute  -address )&g  -» 

-*  (send -it -0  f  f  -chip)  go  (fetch  -instruction)  CD  -*  (bring  -it  -on -chip)  pg 

Thus,  using  Teyei,  for  the  cycle  time,  we  derive  the  basic  critical  path  equations: 

T'ycu  ^  (reg  -prech.  )HC  +  (reg  -read)CA  +  (reg  -write  )AH 

Tcycu  ^  (reg -decode) Ec  +  (reg  - read )GA  +  (ALU -add )AB  -  tl 

Tcyc  u  i  ( pins-out)Bc  +  (mem -read) cb  +  (pins  -in) DE  +  tx 

Thus,  the  parameter  t  x  represents  a  trade-off  between  memory  and  CPU  speed. 

The  faster  the  memory-access  time  is,  the  larger  tx  becomes,  and  the  slower  the 
register-decoding  and  reading  and  the  ALU  can  be. 

4.2.2  The  RISC  II  4-Phase  Timing. 

Mapping  a  timing  dependency  diagram,  like  the  one  of  figure  4.2.1,  into  con¬ 
crete  clock  phases  for  an  actual  data-path,  requires  time-area  trade-offs  to  be 
considered  and  compromises  to  be  made.  For  RISC  II,  a  4-phase  clock  was 
chosen,  for  the  following  reasons: 
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symmetric  clock  phases  are  easy  to  generate; 

register  file  operation  is  non-ideal,  due  to  the  high  resistivity  of  the 
polysilicon  word  lines; 

the  register  address  decoders  are  simplified; 
the  shifter  must  be  used  twice  per  cycle. 
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Below,  we  discuss  these  points  and  present  the  utilization  of  the  data-path  dur¬ 
ing  the  four  non-overlapping  clock  phases.  The  choices  described  here  were 
made  before  the  data-path  was  laid-out  and  were  based  on  estimates  of  the  vari¬ 
ous  delays.  The  next  subsection,  4.2.3,  compares  these  estimates  with  the  tim¬ 
ing  picture  that  emerged  after  the  lay-out  was  completed  and  circuits  were 
simulated  with  their  actual  parasitic  capacitances. 

The  timing  design  started  with  the  estimate  that  a  register  read  takes  more 
time  than  a  register  write,  which,  in  turn,  takes  more  time  than  precharging  the 
register-file  busses.  Instead  of  allocating  three  unequal  clock  phases  for  each 
one  of  these  operations,  it  was  decided  to  use  a  4-phase  clock  with  two  long 
phases  and  tpz  (»80nsec  each),  interleaved  with  two  short  phases  <pz  and 
(»60nsec  each).  This  would  make  clock  generation  easier,  since  the  generator 
could  now  have  a  2-phase  period.  Having  more  clock-phases,  of  a  shorter  dura¬ 
tion  each,  is  also  useful  in  fine-tuning  the  timing  of  the  various  operations.  On 
the  other  hand,  4  phases  may  result  in  wasted  time  during  the  non-overlap 
periods  between  clock  phases. 

To  match  the  register  file  requirements  with  the  four  defined  clock  phases, 
register  reads  were  planned  to  stretch  over  both  ipi  and  <pz,  'while  phases  3  and  4 
were  allocated  to  the  register-write  and  precharge  operations,  respectively.  The 
implications  of  the  high  resistivity  of  the  polysilicon  word  lines  was  studied  care¬ 
fully.  The  RC  time  constant  of  the  address  lines  can  easily  reach  50  nsec,  caus¬ 
ing  significant  delays  between  their  near  and  far  ends.  If  a  write  operation 
immediately  follows  a  read  operation,  the  read-word-lines  may  not  be  fully  deac¬ 
tivated  by  the  time  writing  begins  and  cause  erroneous  register  file  write 
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operations.  To  avoid  this  hazard,  it  was  decided  to  activate  the  word-line  drivers 
for  read  accesses  during  only.  By  applying  the  read  pulses  to  the  near  end 
during  plt  these  pulses  stretch  into  <p%  for  the  bits  at  the  far  end  of  the  word¬ 
line. 

When  a  static  adder  is  used  in  the  data-path,  it  is  possible  to  mitigate  the 
effect  of  the  above  RC  delay.  By  placing  the  least-significant  bits  of  the  data¬ 
path  at  the  end  of  the  word-lines  near  the  drivers,  the  adder  operation  (carry- 
propagation)  can  start  as  soon  as  these  bits  are  read,  without  waiting  for  the 
most-significant  bits  from  the  feu*  end.  RISC  11,  however,  uses  a  dynamic  ALU  cir¬ 
cuit  with  a  precharged  carry-chain.  That  chain  can  only  be  released  after  all 
input  bits  have  settled. 

Decoding  the  register-addresses  must  be  done  before  the  corresponding 
access  begins,  so  that  the  word-lines  remain  stable  once  activated.  The  chosen 
timing  scheme  has  the  advantage  of  allowing  the  register-file  decoders  to 
operate  during  the  phases  when  the  word-line  drivers  are  disabled,  that  is  dur¬ 
ing  <p2  and  $94.  Thus,  no  pipeline  latches  are  required  at  their  output,  and  the 
congestion  in  that  area  is  alleviated. 

Figure  4.2.2  shows  the  RISC  II  timing  graph,  adapted  to  the  four  clock 
phases  and  to  the  data-path  of  figure  4.1.1.  Register  reads  occur  during  and 
part  of  tpz\  thus,  the  ALU-operation  timing  was  defined  to  use  the  end  of  <pg  for 
set-up,  and  <p$  and  part  of  <p+  tor  carry  propagation.  Memory  accesses  start  after 
the  ALU  operation  has  been  completed.  The  effective-address  is  sent  off  the  chip 
late  in  $94  and  during  the  next  <p\.  Data  or  instructions  come  back  into  the  CPU 
late  in  593,  just  in  time  for  ths  instruction  and  source-register-numbers  to  be 
decoded  during  594.  For  write  memory-accesses,  data  are  sent  to  memory  dur¬ 
ing  <P2>  following  the  address.  This  relatively  late  transfer  of  the  write-data  is  not 
a  limiting  factor  for  memory  chips,  because  address  decoding  must  occur 
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before  the  data  are  needed. 

The  choice  of  using  the  shifter  twice  per  cycle,  for  operations  with  diverse 
timing  (sect.  4.1.3),  had  important  implications.  The  shifter  is  used  for 
shifts/alignments  and  for  bringing  the  immediates  into  the  data-path.  The  fact 
that  it  has  to  be  precharged  before  each  use  is  another  reason  why  four  phases 
per  cycle  are  necessary. 

An  immediate  constant  must  be  routed  through  the  shifter  during  It 
cannot  be  routed  during  pi,  because  the  previous  instruction  may  have  been  a 
load  that  used  the  shifter  during  p*  for  aligning  its  data.  The  routing  of  the 


immediate  constant  through  the  shifter  may  totally  overlap  the  stretched  regis¬ 
ter  read  operation  during  tpz-  1°  this  case,  no  extra  delays  are  introduced 
because  of  this  routing.  This  was  our  original  estimate  and  an  additional  reason 
why  it  was  decided  not  to  spend  silicon  area  for  an  extra  vertical  bus  for 
immediates  (sect.  4.1.3).  However,  this  balance  of  delays  of  the  two  operations 
is  strongly  dependent  on  the  implementation.  The  routing  of  immediates  may 
easily  extend  beyond  the  read  operation.  This  would  introduce  extra  delays  and 
routing  of  the  immediates  would  become  the  critical  path  leading  to  the  ALU. 

4.2.3  The  RISC  II  Timing,  Reconsidered  after  Lay-out. 

After  laying  out  the  chip,  its  performance  was  studied  with  circuit  simula¬ 
tion  and  analysis,  based  on  the  actual  sizes  and  resulting  parasitic  capacitances 
of  the  circuit  elements.  The  critical  portions  of  the  data-path  were  simulated 
with  SPICE2  [NaPe73],  using  the  "worst-case-speed"  parameters  shown  in  Table 


Table 
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The  delays  through  the  rest  of  the  chip  were  estimated  using  the  timing  verifier 
Crystal  [Oust03],  which  utilizes  a  simple  RC  model.  Figure  4.2.3  shows  some 
results  of  that  study.  Two  major  deviations  from  the  delay  values  originally 
expected  made  the  actual  timing  noticeably  different  from  the  ideal  picture  of 


Figure  4.2.1. 


The  actual  length  and  capacitance  of  the  word-lines  is  less  than  originally 
anticipated.  An  accurate  simulation  showed  that  the  "stretching"  of  the 
register-read  operation  into  <pz  was  small.  This  puts  the  routing  of  the 
immediate-field  into  the  critical  path  (§  4.2.2). 
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Figure  4.2.3:  The  RISC  II  Timing  as  Simulated  after  Layout. 


Phases  2  and  4  have  to  be  longer  than  BOnsec  because  register-file  decoding 
is  slower  than  expected.  RISC  11  has  13B  registers  requiring  276  decoding  gates 
in  its  two-port  overlapping-window  register  file.  Minimizing  the  size  and  power 
dissipation  of  these  gates  was  crucial  because  of  their  large  number,  even 
though  it  led  to  slower  operation.  This  effect  compounds  to  the  delay  in  the 
decoding  circuits  due  to  the  overlapping-window  scheme  (figure  3.2.1(a)).  The 
decoding  gates  for  the  non-overlap  registers  are  6-input  NOR  gates  with  a  delay 
of  about  30  nsec.  (Their  low-power  pull-up  has  W / L- 0.5,  while  it  is  pulling-up 
«0.25 pF  of  load  capacitance,  consisting  of  45XZ  of  gate  capacitance,  200X2  of 
drain-diffusion  capacitance,  and  a  160-A  long  polysilicon  wire.)  However,  the 


overlap  registers  require  OR-AND-INVERT  decoding  gates,  which  have  a  delay  of 
about  70  nsec.  If  the  circuit  of  figure  3.2.1(b)  had  been  devised  earlier,  decoding 
time  for  the  overlap  registers  in  RISC  II  could  have  been  reduced  to  about  40  to 
45  nsec. 

Another  issue  studied  is  the  delay  resulting  from  routing  data  from  the 
register-file  across  the  shifter  to  the  ALU.  Driving  busses  D  and  R  from  busses  A 
and  B  takes  approximately  20  nsec.  The  use  of  busses  D  and  R  to  feed  the  ALU  ~ 
instead  of  extending  A  and  B  all  the  way  to  the  latter  —  was  chosen  because  it 
leads  to  a  much  less  congested  layout  in  the  area  of  SRC,  and  because  it 
simplifies  ALU  input  multiplexing  (sect.  4.1.3).  Extending  busses  A  and  B  all  the 
way  to  the  ALU  would  have  increased  their  capacitance  and  slowed  down 
register-reads  by  about  15  nsec. 

4.2.4  Lessons  Learned. 

Here  we  present  some  insights  gained  during  the  design  of  the  data-path. 
They  result  from  comparing  the  ideal  RISC  II  timing  (§  4.2.1)  with  the  originally 
planned  real  timing  (§  4.2.2)  and  with  the  actual  timing  that  finally  resulted  (§ 
4.2.3).  It  has  become  clear  that  loss  in  performance  can  be  attributed  to  two 
main  reasons: 

•  Not  enough  hardware  resources  were  allocated  to  frequent  operations. 

•  Too  many  hardware  resources  were  allocated  to  infrequent  operations. 

This  is  yet  another  expression  of  the  RISC  concept:  Capabilities  added  to  a  cir¬ 
cuit  in  order  to  speed  up  some  operation(s)  will  slow  down  other  operations. 
Thus,  the  only  capabilities  that  should  be  added  to  a  circuit  are  the  ones  that 
speed  up  the  most  frequently  used  operations. 


The  fine  balance  of  the  delays  on  the  word-lines  and  of  routing  the  immedi¬ 
ate  through  the  shifter,  that  was  originally  sought,  was  not  achieved.  The  area 
saved  by  passing  the  immediates  through  the  shifter  incurred  significant  perfor¬ 
mance  penalties.  By  spending  extra  hardware,  we  could  have  eliminated  the 
need  for  <pz,  and  thus  significantly  speed  up  the  RISC  II  CPU.  The  extra  hardware 
is  not  trivial,  but  would  have  been  worth  spending: 

•  An  extra  19-bit  vertical  bus  would  be  needed  on  the  left  side  of  the  ALU 
for  introducing  the  immediate  constants  into  the  data-path. 

•  A  more  complicated  register-address  decoder  would  be  needed  to  over¬ 
lap  the  write-address  decoding  with  the  read  operation. 

•  Pull-down  transistors  would  be  needed  at  the  far  end  of  the  register 
word-lines  to  suppress  the  read-pulses  at  the  end  of  the  read  operation 
and  before  the  beginning  of  the  write  operation. 

In  general,  enough  specialized  resources  should  be  dedicated  to  the  key  CPU 
operations. 

On  the  other  hand,  the  area  occupied  by  the  cross-bar  shifter  and  by  its 
associated  input  latch/driver,  SRC,  is  significant,  and  so  is  the  bus  congestion 
caused  by  its  busses  R  and  L  in  the  DST-SRC  area.  This  introduces  delays  into 
the  frequently  used  path  between  ALU  and  the  register-file,  as  discussed  at  the 
end  of  the  last  subsection.  The  shifter  could  have  been  placed  on  the  right-hand 
side  of  the  ALU,  and  immediates  introduced  into  the  data-path  through  a 
separate  bus.  It  would  still  consume  precious  space  in  the  overloaded,  critical 
horizontal  length  of  the  data-path.  but  the  data  transfer  delays  would  be 
reduced.  In  most  programs,  shifting  by  an  arbitrary  amount  occurs  rarely 
(chapter  2).  Shifting  by  one  or  two  bits,  as  required  for  multiplications  and  for 
conversions  of  array  indexes  to  byte  addresses,  could  be  performed  in  the  ALU. 
Our  conclusion  is  that  an  arbitrary-amount  shifter  does  not  belong  in  the  critical 
part  of  the  data-path;  it  could  be  included  somewhere  else,  accessible  only  by 
slower-executing  instructions.  For  example,  it  could  be  placed  near  DIMM  (fig. 


4.1.1)  where  It  would  also  be  useful  in  aligning  data  from  memcry. 


4.3  The  RISC  II  Control. 

The  main  consequence  of  a  reduced  instruction  set  is  the  dramatic  reduc¬ 
tion  of  the  silicon  resources  required  for  control.  The  RISC  II  opcode-decoder: 

•  occupies  only  0.5  3  of  the  chip  area, 

•  has  0.7  3  of  the  transistors, 

•  required  less  than  2  %  of  our  total  design  and  layout  time. 

This  opcode-decoder  is  the  equivalent  of  the  micro-program  memory  in  micro- 
coded  CPU’s.  The  RISC  II  control  section  occupies  only  10  7.  of  the  chip  area. 
These  figures  stand  in  sharp  contrast  to  the  usual  size  of  control  in  contem¬ 
porary  microprocessors  [FitzSl]  [Beye8l]: 

•  the  M68000  control  section  is  683  of  the  chip  area; 

•  the  Z8000  control  section  is  533  of  the  chip  area; 

•  the  iAPX-432-01  control  section  is  853  of  the  chip  area; 

•  the  HP  "Focus"  32-bit  CPU  has  783  of  its  transistors  in  its  microcode 
ROM. 

Although  the  organization  of  the  RISC  II  control  could  be  considered  as 
"random  logic",  it  only  required  half  a  person-year  of  design  and  layout  effort. 

4.3.1  Organization  of  the  RISC  II  Control. 

Figure  4.3.1  shows  the  organization  of  the  RISC  II  control.  Registers  are 
tagged  with  a  number  (l),  (2),  or  (3)  to  indicate  the  pipeline  stage  to  which  the 
Information  they  hold  belongs.  Thus,  the  registers  marked  "(1)"  hold  the 


information  of  the  instruction  being  currently  fetched,  while  those  marked  "(2)" 
hold  information  relating  to  the  currently  executing  instruction.  This  latter 
information  will  flow  into  the  "(3)-registers"  during  the  next  cycle,  if  necessary. 
This  flow  of  information  among  the  pipeline  registers  freezes  when  the  pipeline  is 
suspended  for  a  data-memory-access. 

Decoding  the  incoming  instruction  is  particularly  easy  in  RISC  II,  because  of 
the  simple  instruction  format  with  fields  of  fixed  size  and  position  (sect.  3.1.4). 
RISC  II  does  not  have  a  single  physically-integrated  instruction  register. 
Instead,  it  has  multiple  instruction-field  registers,  each  one  close  to  the  place 
where  it  is  needed. 


•  Some  instruction  fields  may  have  multiple  interpretations,  depending  on 
the  instruction.  They  pose  no  problem:  copies  of  the  same  field  are 
latched  at  all  places  of  possible  use,  and  the  unnecessary  ones  are 
thrown  away  later  on.  Examples: 

—  Fields  <18:14>,  <13>,  and  <4:0>  may  be  parts  of  an  immediate  con¬ 
stant.  or  they  may  be  Rai,  IMMflag ,  and  R, z  respectively. 

—  Field  <22:19>  may  be  part  of  ■/?*,  or  it  may  be  the  jump-condition. 

•  One  instruction  field,  that  must  be  used  as  soon  as  the  instruction  arrives 
into  the  CPU.  may  originate  from  two  different  places  in  the  instruction, 
depending  on  the  opcode: 

—  The  second  register  to  be  read  via  6us£?  is  /?,z  (<4:0>)  for  most  instruc¬ 
tions,  but  it  is  Ri  (<23:19>)  for  sfore  instructions. 

This  does  pose  a  problem.  The  selection  of  the  appropriate  origin  may  not 
wait  until  the  opcode  has  been  decoded  through  the  normal  decoder. 
Instead,  a  special  fast  gate  is  used  to  distinguish  store  from  non  -store 
instructions.  The  particular  choice  of  opcodes  allows  this  distinction  to 
be  made  very  quickly,  based  on  whether  instr<30:29>=ll. 


The  register-numbers  and  the  immediate-constant  move  through  the  pipe¬ 
line  of  instruction-field-registers  and  are  used  at  the  appropriate  place  and 
time.  Figure  4.3.1  shows  their  organization  and  should  be  compared  with  figure 
4.1.1.  The  imm -fleld-register-(2)  is  DIMM  in  fig.  4.1.1.  The  register-number 
fleld-registers-(l)  are  not  shown  in  fig.  4.1.1,  for  simplicity.  Figure  4.3.1  also 
shows  the  circuit  which  detects  data-dependencies  and  initiates  internal- 


stage-(2)  for  equality. 


The  7-bit  op-code,  together  with  one  bit  of  state  information,  is  decoded  to 
generate  30  bits  of  decoded-opcode  information.  Op-codes  are  decoded  ones  per 
cycle,  and  the  one  bit  of  state  serves  to  distinguish  between  a  normal  cycle  and 
a  memory-data-access  cycle  during  which  the  rest  of  the  pipeline  activity  is 
suspended. 

Twenty-eight  of  the  30  decoded-opcode  bits  are  used  during  the  second 
(computation)  stage  of  the  pipeline.  Two  others,  together  with  SCCflag  (set- 
condition-codes  flag),  are  used  to  control  the  activities  during  the  last  stage  of 
the  pipeline,  which  is  the  only  one  that  modifies  the  user-visible  state  of  the 
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CPU.  These  three  bits  control  writing  into  R^t  PSW,  or  the  CCS.  On  interrupts, 
they  get  cleared,  thus  effectively  aborting  the  instruction  that  was  executing  f. 
Also,  on  interrupts,  the  op-code  latch  gets  loaded  with  a  special  hardwired 
instruction,  calli.  This  instruction  calls  the  interrupt  handler,  changes  the 
current  window  (sect.  3.2.2),  and  saves  LSTPC  into  Rzt  of  the  new  (free)  window. 
In  this  way,  the  interrupted  instruction  can  be  restarted.  More  information  on 
interrupts  can  be  found  in  Appendix  A. 

Besides  the  30  decoded-opcode  bits,  nine  more  bits  of  information  are 
involved  in  the  generation  of  the  control  signals.  They  are  (see  fig.  4.3.1): 


•  SCCflag:  set-condition-codes  flag  from  the  instruction. 

•  nniflag:  immediate  flag  from  the  instruction,  specifying  whether  short- 
SOURCE2  is  R, 2  or  an  immediate  (fig.  3.1.2). 

•  Match-Detect:  detection  of  data-dependencies  among  second  and  third 
pipeline  stages,  and  initiation  of  internal-forwarding  (§  3.3.1). 

•  JUHPcond:  the  result  of  evaluating  the  condition  for  a  conditional  jump. 

•  SRCsign:  the  sign-bit  of  the  SRC  latch,  used  to  control  sign-extension  dur¬ 
ing  shift  instructions. 

•  DIMM sign:  the  sign-bit  of  the  DIMM  latch,  used  to  control  sign-extension 
of  immediate  operands,  or  during  food  instructions. 

•  BAR:  the  Byte-Address-Register,  used  in  detecting  address  misalignments. 


The  control  signals  are  generated  by  ANDing  one  or  more  of  the  above  39 
bits  with  one  or  more  of  the  clock  phases  (as  in  polyphase  microcoded  imple¬ 
mentations).  RISC  II  uses  four  clock  phases,  as  discussed  in  sect.  4.2.2.  These 
are  externally  supplied  but  the  fourth  one  of  them,  p*,.  is  internally  "split”  into 
two  mutually  exclusive  phases,  p4  and  (P/jvr-  Normally,  994  is  issued;  however, 

when  an  interrupt  occurs,  p/w  replaces  it  and  sets  and  clears  all  the  crucial 

t  An  interrupt  that  occur*  during  the  memory-cycle  of  a  store  instruction  will  not  prevent 
the  memory-write  from  occurring,  if  it  may  occur  (i.e.  if  a  page-fault  was  not  the  interrupt 
cause).  The  same  store  instruction  will  be  re-executed  when  the  interrupt-handler  returns, 
re-writing  exactly  the  same  data  into  exactly  the  same  memory  location.  Notice  that  the  in¬ 
terrupt  cause  may  not  have  been  an  address-misalignment,  since  misalignment-interrupts 
only  occur  during  the  address-calculation  cycle. 


4.3.1 


bits,  as  discussed  above. 


There  are  100  control  signals,  half  of  them  for  the  data-path,  and  half  of 
them  for  the  control  section.  These  include  multiple  copies  for  local  uses,  clock 
phases  with  no  control  qualification,  and  decoded-opcode  bits  with  no  clock 
qualification.  Most  of  the  100  timing  gates  which  generate  them  are  very  simple: 

•  87  timing  gates  have  not  more  than  one  clock  input: 

•  70  timing  gates  have  not  more  than  one  clock  input  and  not  more  than 

one  control-bit  input. 

Only  18  of  the  control  signals  depend  on  control  bits  other  than  the  decoded- 
opcode  bits. 

Thus,  the  organization  of  the  RISC  11  control  is  simple  and  straightforward. 
There  is  a  flnite-state-machine  with  7  inputs,  30  outputs,  and  only  2  states.  Its 
outputs  are  combined  with  4  clock  phases  to  generate  the  control  signals. 
Because  not  all  timing  gates  are  simple  2-input  AND  gates,  and  because  some 
control  bits  are  generated  outside  the  FSM.  one  may  find  elements  of  "random 
logic"  organization  in  the  RISC  II  control.  However,  that  is  not  the  issue.  The 
issue  is  that  that  control  section  has  a  straightforward  organization,  that  it  is 
easy  to  understand,  and  that  it  required  only  six  person-months  of  design  and 
layout  effort. 

4.3.3  Simplicity  of  the  RISC  II  Control. 

In  RISC  II,  a  few  dozen  bits  of  information  are  enough  to  control  the  execu¬ 
tion  of  an  instruction.  This  number  stands  in  sharp  contrast  to  the  much  larger 
number  of  microprogram  bits  required  to  execute  each  instruction  in  typical 
microprogrammed  machines.  We  attribute  this  reduction  of  required  informa¬ 
tion  to  the  uniformity  of  the  execution  of  the  RISC  instructions.  All  RISC  instruc¬ 
tions  follow  the  pattern:  read-sourcgs,  operate,  appTopriately*route-the~result, 


and  they  follow  it  with  the  same  fixed  timing.  Thus,  the  only  information  which 
the  instruction-decoder  needs  to  generate  is  whether  or  not  a  certain  control 
signal  must  be  activated  during  the  execution  of  the  current  instruction.  The 
particular  time  during  which  the  control  signal  might  be  activated  is  known  in 
advance  and  hardwired  into  the  gate  that  drives  it. 

An  important  characteristic  of  the  instruction-decoder  is  its  simplicity  as  a 
combinational  circuit.  It  decodes  8  inputs  which  can  be  in  one  of  56  relevant 
states:  23  single-cycle  instructions,  16  two-cycle  instructions,  or  illegal  (un¬ 
assigned}  op-codes.  It  has  36  distinct  product  terms  and  30  outputs.  The  aver¬ 
age  number  of  product  terms  participating  in  the  generation  of  an  output  is 
1.47.  Thus,  if  it  were  to  be  implemented  using  a  general  PLA,  the  OR  plane  would 
be  very  sparse:  36x30  crosspoints  containing  only  44  transistors.  For  that  rea¬ 
son,  we  used  a  generalized  decoder  with  a  single  row  of  OR  gates,  instead.  This 
implementation  consumes  about  60  7.  less  area  than  the  PLA  implementation, 
and  it  is  significantly  faster. 

We  attribute  the  low  number  of  product  terms  per  decoded-instruction  bit 
to  the  fact  that  the  instruction  set  is  highly  orthogonal,  and  that  orthogonality 
maps  into  the  op-codes  and  into  the  data-path.  The  low  density  of  assigned 
opcodes,  using  only  39  legal  opcodes  out  of  64  (see  Appendix  A),  also  had  a  posi¬ 
tive  effect.  Consider  the  following  examples  (x's  mean  "don’t  care"): 


Class  of 
opcodes: 

Instructions: 

Olxxxxx 

load/store 

Olxxxxl 

0001x01 

PC-relative 

load/store 

PC-relative 

jump/call 

OlOlxlx 

signed  load 

conditional 

OOOllxx 

control  transf. 

Hardware  function 
controled: _ 


two-cycle  instruction:  set  state-bit 
to  pipeline-suspension-cvcle 


place  PC  onto  busD  on  <p%\ 
don  NOT  place  busA  onto  busD  on 


fimp/ret) 


sign-extender/zero-flller _ 

together  with  JUMPcond,  determines 
whether  to  place  INC  or  the  ALU  output 
onto  busOUT  on 


Each  one  of  these  classes  of  instructions  is  decoded  using  a  single  product  term. 
The  selection  of  the  39  opcodes  within  a  2a  opcode  space  was  done  with  the  pur¬ 
pose  of  minimizing  the  size  of  the  decoder:  it  required  only  a  2  person-hours 
effort.  The  selection  of  the  AND  and  OR  terms  of  the  decoder  was  done  in  2 
further  person-hours.  These  numbers  demonstrate  the  simplicity  of  the  RISC  II 
control. 

One  may  consider  a  circuit  with  such  a  small  and  simple  control  section  as 
a  "hardware  engine"  rather  than  a  High-Level-Language  Computer.  However,  as 
section  3.4  has  shown,  this  circuit  executes  compiled  High-Level-Language  pro¬ 
grams  faster  than  several  popular  commercial  processors.  Its  compiler  and 
optimizer  were  easily  written,  and  the  machine  code  is  no  more  than  50  %  larger 
than  that  of  other  processors. 
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Design  Metrics  of  RISC  II. 


Table  4.4.1  presents  several  design  metrics  for  RISC  II.  It  gives  the  total 
absolute  values  for  the  chip  area,  number  of  transistors,  povrer  consumption, 
number  of  rectangles,  and  approximate  design  and  layout  time,  as  well  as  the 
percentages  of  these  values  attributed  to  various  CPU  sub-functions. 


Table  4.4.1:  RISC  II  Design  Metrics. 

Part 

% 

Area 

7. 

Tran¬ 

sistors 

% 

Power 

(worst 

case) 

Re 

7. 

Drawn 

ctangl 

% 

Inst. 

es 

Regu¬ 

larity 

~7. 

Design 

rime 

Layout 

Data-Path:  (tot.) 

Register  Tile 

(storage  array) 
(decoders) 

ALU 

Shifter 

(cross-bar  array) 
(inp.latch/dr.  decoder) 
PC's 

Other  HUX/latch/drivers 
Power  wiring 

50. 

33.3 

B» 

4.3 

& 

3.2 

3.9 

92.6 

73  4 

4.8 

3.7 

57. 

39.3 

PiU 

4.0 

2.9 

2:9] 

4.2 

8.8 

23.5 

22 

fffl 

4.1 

8.3 

3.6 

8.1 

90.0 

70  0 

4.8 

84 

is.oj 

4.7 

4.4 

74. 

624 

(3640] 

'.r 

i® 

11 

13. 

3.3 

2.1 

1.0 

J 

2.6 

10. 

2.3 

1.1 

1.5 

iSj 

1.1 

3.3 

Control:  (tot.) 

Opcode  Decoder 
fnstr.AControl  Registers 
CC'a.  Jmp  Cond.  Interr. 
Window  Number 

Timing  Cates /Drivers 

Wiring  (non-power) 

Power  wiring 

10. 

.5 

1.6 

.8 

.8 

1.0 

4.8 

.9 

5.7 

.7 

1.9 

1.3 

.8 

1.3 

13. 

1.0 

4.7 

2.4 

1.0 

3.9 

54.4 

2.0 

s.s 

10.8 

4  9 
168 
13.0 

5.8 

.6 

1.6 

1.2 

.8 

1.0 

.7 

2.1 

7.2 

8.7 

2.1 

1.8 

1.1 

1.0 

7. 

1.1 

.5 

1.3 

.5 

1.7 

10. 

.8 

1.2 

1.2 

.4 

1.8 

4.3 

Periphery:  (tot.) 

Bonding  Pads 

Wiring  (non-power) 

Power  wiring 

Unused  area  (logo's) 

40. 

10.3 

13.0 

9.2 

7.2 

1.7 

1.7 

30. 

30.0 

22.1 

3.1 

9.8 

1.8 
8.0 

4.2 

2.7 

.8 

.1 

.8 

3.5 

16.8 

1.5 

1.0 

1.4 

2.4 

.8 

.6 

.6 

3.6 

1.5 

1.7 

.4 

Micro-Archit.  Design 
Debugging  /  Verification 
Document.  &  Overhead 

9. 

2. 

27. 

16. 

% 

100.0 

100.0 

100.0 

100.0 

100.0 

19.6 

60. 

40. 

Total  CPU  Abs. 

Value 

14.9 

MA2 

40.76 

K 

1.9 

Walts 

23.5 

K 

460. 

K 

5250 

man-hours 

The  register-number  decoders  and  shift-amount  decoder  were  included  in 
the  data-path.  The  control  section'  is  subdivided  into  opcode-decoder, 
instruction-and-control-registers,  and  timing-gates  (see  Figure  4.3.1),  as  well  as 
other  specialized  circuits  (condition  codes,  jump  condition,  interrupt  logic,  win¬ 
dow  numbers)  and  wiring.  Areas  are  separately  given  for  the  power  wiring 
(ground  and  +5V)  in  the  data-path  and  control  sections.  For  the  rest  of  the 
metrics,  the  power  wiring  of  the  data-path  and  of  the  control  is  included  in  their 
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various  sub-blocks,  from  which  it  was  difficult  to  separate. 

The  numbers  given  for  power  dissipation  use  worst-case-power  parameters, 
which  are  different  from  those  shown  in  table  3.1:  Vteo=0.7V,  V}>7>o=-3.8 V, 
lateral-diffusion  =  0.7 jum,  k'=30.7fiA/  Vs. 

The  layout  tool  used  was  Caesar  [OustBl],  which  allows  only  rectangles  with 
horizontal  and  vertical  edges.  We  are  convinced  that  the  area  penalty  paid  for 
this  restriction  was  minimal  t*  and  that  it  was  well  worth  the  resulting 
simplifications  in  the  layout  task  and  in  our  CAD  tools.  The  number  of  "drawn 
rectangles"  counts  the  rectangles  explicitly  specified  in  the  Caesar  data  base. 
This  number  exaggerates  the  number  of  rectangles  actually  placed  by  the 
designer.  It  counts  a  slightly  modified  copy  of  a  cell  as  a  totally  different  cell,  as 
is  the  case  with  most  timing-gates.  The  number  of  "instantiated  rectangles" 
counts  all  geometry  after  arrays  and  calls  have  been  expanded. 

Design  and  layout  times  are  approximate.  The  totals  for  data-path  and  con¬ 
trol  are  higher  than  the  sum  of  the  parts,  because  they  include  some  general, 
organizational  work.  The  elapsed  time  was  half  a  year  (times  one  person)  for  the 
micro-architecture  design,  plus  two  years  (times  two  persons)  for  everything 
else.  The  total  of  5250  man-hours  given  corresponds  to  2.7  man-years,  and  is 
lower  than  the  real  elapsed  time  because  of  other  activities  occurring  in  parallel 
(courses,  other  research).  It  does  not  include  work  performed  after  the  chip 
was  submitted  for  fabrication  (i.e.  more  documentation  and  testing). 

In  section  4.3  the  size  of  the  RISC  II  control  section  was  compared  to  that  of 
other  micro-processors.  Here,  we  compare  the  number  of  transistors,  the  regu¬ 
larity,  and  the  design  and  layout  effort  for  the  whole  chip  (data  from  [FitzBl]): 

t  The  register-cell  is  limited  by  fundamental  line  widths  in  both  directions,  and  could  not  be 
smaller  with  inclined  lines.  The  shifter's  width  could  be  reduced  by  16  X  (=  0.3  X  of  the  chip 
width)  using  45*  lines  [SKPS62]. 
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CPU: 

Transistors 

K 

Regularity 

Design 

Derson-months 

RISC  I 

44 

22 

15 

12 

RISC  II 

41 

20 

18 

12 

M68000 

68 

12 

100 

70 

Z8000 

18 

5 

60 

70 

iAPX-432-01 

110 

8 

170 

90 

When  these  numbers  are  combined  with  the  performance  comparisons  of 


I 


sect.  3.4.3,  the  advantages  of  the  Reduced  Instruction  Set  approach  become  evi¬ 
dent. 
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CHAPTER  5: 


DEBUGGING 
AND  TESTING 
RISC  II. 


9 

This  chapter  describes  the  method  used  and  the  experience  gained  in  the 
^  process  of  debugging  the  RISC  II  logic  design  and  layout  and  in  the  functional 

testing  of  the  RISC  II  chips.  The  fact  that  RISC  II  chips  worked  correctly  on  first 
silicon  is  a  result  of  both  the  simple  architecture  and  the  effectiveness  of  the 
CAD  environment  that  was  used.  The  fact  that  RISC  II  chips  were  easily  tested. 

ft 

without  the  need  to  use  the  scan-in/scan-out  loops,  is  again  due  to  the  simple 
architecture  with  a  readily  accessible  CPU  state. 

^  In  this  chapter  we  deal  only  with  logic  debugging  and  functional  testing. 

Timing  analysis  of  the  critical  data-path  and  control  circuits  was  done  with 
SP1CEZ  [NaPe73],  To  check  that  the  timing  constraints  were  met,  the  timing 
P  verifier  Crystal  [0ust83]  was  used  after  the  whole  chip  was  laid-out.  Geometri¬ 

cal  layout  rules  were  checked  with  Lyra  [Ar0u82].  For  further  discussions  on 
circuit  simulation  and  timing  see  Sherburne’s  thesis  [Sher83]. 

r> 


5.1 


Logic  Debugging  Tools  and  Methods. 


The  sophisticated  simulation  tools  available  in  today's  CAD  environments 
make  it  feasible  to  debug  a  VLSI  design  almost  completely  before  fabrication. 
Such  debugging  is  desirable  for  several  reasons. 

Software  simulation  has  a  faster  turn-around  time  and  lower  cost  than  pro¬ 
totype  fabrication,  and  this  is  likely  to  stay  true  in  the  next  years.  Even  if  this 
situation  should  be  reversed  some  day,  due  to  major  advances  in  the  implemen¬ 
tation  of  IC  chips,  software  simulation  can  typically  not  be  avoided.  If  a  proto¬ 
type  returned  from  fabrication  does  not  perform  as  expected  when  plugged  into 
the  system  or  test  set-up,  how  can  one  find  out  why  it  might  not  work  properly? 
Unless  some  revolutionary  new  method  of  hardware  testing  is  discovered,  our 
capability  to  monitor  or  modify  the  value  of  internal  nodes  in  a  VLSI  chip  is  very 
limited.  Mechanical  probing  is  heavily  constrained  in  terms  of  number  and  size 
of  nodes,  and  because  of  the  capacitive  loading  introduced  into  the  circuit. 
Scanning  electron  microscope  methods  cannot  simultaneously  monitor  many 
fast-changing  nodes.  Software  simulation,  on  the  contrary,  offers  the  capability 
of  monitoring  and  changing  the  values  of  internal  nodes.  This  is  crucial  in  com¬ 
plex  VLSI  systems  with  limited  controlability /observability.  Simulation  is  thus 
often  the  only  way  to  gain  understanding  of  the  causes  of  malfunctioning  cir¬ 
cuits. 

Hierarchical  design  and  multi-level  simulation  tools  make  it  possible  to 
debug  a  VLSI  design  at  a  level  of  abstraction  which  is  higher  than  transistors  and 
capacitors.  This  makes  possible  a  properly  structured  design  employing  early 
debugging,  before  bad  assumptions  or  errors  lead  to  poor  design  decisions.  For 
complicated  systems,  this  hierarchical  approach  al“o  makes  the  task  of  debug¬ 
ging  manageable  by  checking  each  block  at  the  proper  level  of  abstraction. 


5.1 


The  RISC  II  design  was  done  at  four  levels: 

•  architecture  (ISP), 

•  micro-architecture  (RTL), 

•  gates  (logic),  and 

•  layout  (circuit  k  mask  geometry). 

The  architecture  level  corresponds  to  the  system  specification.  The  RISC  I 
k  II  architecture  was  described  using  the  ISPS  notation  [CorcBO]  [BaMa79],  The 
corresponding  simulator  was  used  to  test  the  architecture,  but  it  was  too  slow  to 
run  large  programs.  A  "special  purpose"  RISC  simulator  program  was  written 
[TamiBl]  and  was  used  in  conjunction  with  the  RISC  compiler  and  assembler  for 
evaluating  the  RISC  performance. 

The  micro-architecture  level  corresponds  to  a  register-transfer  description, 
roughly  like  the  one  given  in  chapter  4.  The  Slang  language  and  simulator 
[VDFo82]  were  used  for  this  level,  as  described  below  in  §  5.1.1. 

At  the  gate  level,  we  only  used  diagrams  on  paper.  Because  that  level  is 
quite  close  to  the  final  layout,  the  lack  of  machine-readable  description  was  not 
important. 

t 

At  the  layout  level,  Esim  [TermES]  was  used  for  switch-level  simulation  of 
the  circuit  that  was  extracted  from  the  layout,  as  described  below  in  §  5.1.2  and 
§5.1.3. 

5.1.1  SLANG:  Simulation  and  Debugging  at  the  RTL  Level. 

Slang  is  a  LISP-based  hardware  description  language  and  event-driven- 
simulator  [VDFoB2]  that  is  suitable  for  describing  and  simulating  digital  systems 
at  mixed  levels  of  abstraction.  We  did  use  it  in  such  a  mixed-mode  description 


and  simulation: 


•  Gate-level:  Some  parts,  such  as  the  timing-gates  in  the  control  section, 
were  described  at  the  gate-level. 

•  Gate-Vector-level:  Some  latches  in  the  data-path,  whose  proper  internal 
operation  needed  verification,  were  described  as  two  cross-coupled  32-bit 
vectors.  Bitwise  boolean  operations  on  32-bit  integers  were  used  for  that 
purpose. 

•  Register-level:  Other  latches  were  described  at  the  register-level,  by 
using  assignment  operations  on  32-bit  integers. 

•  Block-level:  Parts  of  the  system  were  described  as  a  block,  using  a  LISP 
program.  Such  was  the  case  for  the  off-chip  memory,  the  register-file  and 
shifter  plus  their  decoders,  the  opcode  decoder,  the  interrupt  logic,  and 
the  jump-condition  PLA. 

•  Real-Polarity-level:  The  data-path  busses  were  described  using  their  real 
polarities.  This  was  done  in  order  to  verify  the  correctness  of  the  design, 
since  some  of  them  are  used  with  different  polarities  at  different  times. 

•  Symbolic-Polarity-level:  For  other  values,  no  actual  polarity  was 
specified;  they  were  just  simulated  as  ON  or  OFF. 

•  Symbolic-Value-level:  Some  of  the  node-values  were  symbolic  constants 
or  lists  of  objects.  Such  was  the  case  for  opcodes  and  instructions,  which 
were  both  described  at  the  assembly  level.  This  mnemonic  representa¬ 
tion  was  very  helpful.  It  was  easy  to  implement  because  of  the  LISP 
environment. 


Difficulties  were  encountered  in  describing  and  simulating  bi-directional 
pass-transistors,  like  the  ones  in  the  register-cell  and  in  the  shifter.  To  solve 
this  problem,  Slang  was  modified  to  permit  multiple  sources  driving  the  same 
node,  provided  that  one  of  them  is  characterized  as  "stronger”  than  the  others. 
Then,  the  bi-directional  transistors  were  modeled  as  two  simultaneous  connec¬ 
tions  in  opposite  directions,  with  each  one  transmitting  a  value  in  its  own  direc¬ 
tion.  This  has  the  undesirable  side-effect  of  creating  a  feedback  loop  in  the 
simulated  circuit,  which  is  not  present  in  the  real  circuit.  In  our  case,  the 
storage  effect  that  the  feedback  loop  gives  did  not  bother  us,  because  the 
corresponding 'real  busses  do  exhibit  dynamic  capacitive  storage,  and  because 
external  sources,  which  are  "stronger"  than  the  pass-transistors,  can  break  the 
loop.  However,  this  fictitious  loop  effect  may  be  a  real  problem  when  this  same 
simulation  technique  is  applied  to  other  systems. 

Slang  allows  nodes  to  have  an  "unknown"  value  and  uses  this  value  during 


initialization  and  when  conflicts  of  multiple  sources  arise.  The  technique  was 
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very  useful  in  making  sure  that  the  RISC  II  chip  can  be  initialized  using  the 
external  pins  alone.  A  minor  problem  had  to  be  overcome  in  a  few  cases.  When 
only  some  of  the  bits  of  a  word  are  unknown.  Slang  will  assign  the  value  unk- 
noiwi  to  the  whole  word.  These  words  thus  had  to  be  split  into  multiple  parts, 
and  the  corresponding  hardware  into  multiple  nodes,  leading  to  a  more  cumber¬ 
some  description. 

Slang  should  normally  be  used  during  the  micro-architecture  design, 
before  layout  work  begins.  In  our  case,  however,  it  was  used  only  after  the 
data-path  was  laid-out.  No  errors  in  the  data-path  design  were  found,  owing 
probably  to  the  processor’s  simplicity.  On  the  other  hand,  5  to  10  design  errors 
were  uncovered  in  the  control  section  and  were  corrected  before  layout  began. 
Simulation  for  debugging  was  done  using  multiple  small  test-programs  as  input. 
The  correctness  of  the  design  was  checked  by  manually  looking  at  the  simula¬ 
tion  results.  This  manual  checking  was  not  very  time-consuming,  and  thus  no 
need  was  felt  to  use  the  architecture  simulator  for  automatic  result  checking. 
The  set  of  test  programs  was  believed  to  be  fairly  complete.  Nevertheless,  later 
during  the  switch-level  simulation  a  design  error  was  found  that  was  not  covered 
by  the  test-programs  (an  instruction  setting  the  carry-bit,  immediately  followed 
by  an  instruction  using  the  carry-bit).  This  shows  that  our  ad-hoc  approach  to 
test  generation  is  not  satisfactory. 

5.1.2  Node  Naming  and  Circuit  Extraction. 

Throughout  the  layout,  a  conscious  effort  was  made  to  flag  as  many  of  the 
nodes  in  every  cell  as  possible  with  names.  This  practice  turned  out  to  be  very 
useful  for  documentation  and  for  debugging.  About  half  a  dozen  layout  errors 
were  uncovered  early  in  the  debugging  phase  merely  through  the  analysis  of  the 
node-naming  error  diagnostics  of  the  circuit  extraction  program.  The  preferred 


names  were  the  ones  used  for  the  corresponding  nodes  in  the  Slang  description. 
Good  and  consistent  naming  conventions  proved  to  be  important  but  were  not 
easy  to  devise  in  the  early  stages  of  the  design.  Polarity  information  was 
included  in  as  many  of  the  names  as  possible. 

We  used  Mextra  [FitzME]  to  extract  the  transistor  and  interconnection  list 
from  the  layout.  The  program  issues  two  types  of  diagnostic  messages: 

(1)  reports  of  the  discovery  of  two  different  names  on  the  same  electric 
node,  and 

(2)  reports  of  the  same  name  appearing  on  more  than  one  node. 

Some  of  those  messages  correspond  to  legal  situations.  Situation  (l)  may  arise 
when  a  node  has  multiple  functions,  and  (2)  may  result  from  cells  that  are  repli¬ 
cated  many  times  for  different  bits  or  functional  units.  Other  messages 
correspond  to  layout  errors.  Type-(l)  messages  indicate  accidental  short- 
circuits  or  erroneous  wiring  of  output  A  to  input  B  and  of  output  B  to  input  A. 
Type-(2)  messages  indicate  missing  connections  if  there  are  more  node 
instances  with  a  certain  name  than  there  should  normally  be  according  to  the 
design.  Mextra  follows  a  certain  naming  convention  that  allows  it  to  separate 
type-(l)  messages  into  legal  and  erroneous  cases,  and  to  report  them 
separately.  However,  no  similar  mechanism  is  available  for  messages  of  the 
latter  type. 

5.1.3  Co-Simulation  at  the  RTL  and  Extracted-Switch  Levels. 

The  transistor  and  interconnection  list  that  was  extracted  from  the  layout, 
was  simulated  with  the  switch-level  simulator  Esirn  [TermES].  A  number  of 
unconventional  circuits  had  to  be  looked  at  carefully,  to  decide  whether  their 
simulation  would  present  any  problems.  If  necessary,  the  circuit-description  file 
was  hand-patched  to  overcome  such  problems: 


•  The  shifter  presented  no  problem,  because  Esim  knows  how  to  handle  bi¬ 
directional  transistors. 

•  The  register  cell  itself,  presented  no  problem,  but  its  interface  to  the 
busses  did.  Firstly  Esim  would  not  detect  the  cell-disturbance  that  could 
occur  if  a  read-operation  were  attempted  via  a  non-precharged  bus;  the 
reason  is  that  Esim  believes  that  a  static  pull-up  is  always  stronger  than  a 
capacitive  load.  Secondly.  Esim  could  not  handle  internal  forwarding 
correctly,  because  in  the  real  chip  that  depends  on  the  DST  driver  being 
stronger  than  the  register  cell;  Esim,  on  the  other  hand,  always  assumes 
that  a  pull-down  is  stronger  than  a  pull-up. 

•  The  bootstrap-drivers  for  the  register  word-lines  could  be  simulated 
correctly,  with  the  only  exception  that  an  unknown  decoder  output  would 
cause  the  unknown  value  on  its  word-line  to  propagate  onto  the 
bootstrapping  clock.  The  reason  is  again  the  lack  of  understanding  that  a 
strong  pull-up  is  present. 

•  Some  static  latches  in  the  control  section  have  a  long  depletion  transistor 
which  is  used  as  a  feedback  resistor.  Esim  considers  all  depletion 
transistors  with  gate  and  source  tied  together  as  pull-ups  (!),  and  thus 
fails  to  simulate  those  latches  correctly. 


A  simulator  that  understands  that  some  transistors  are  stronger  than  others 
would  solve  the  above  problems,  provided  that  it  also  handles  depletion  transis¬ 
tors  correctly. 

Debugging  a  system  at  the  switch  level  is  very  difficult,  because  of  the  large 
number  of  nodes  that  have  to  be  watched,  and  because  the  correct  values  that 
these  nodes  should  have  are  not  always  obvious.  For  that  reason,  Slang  has 
been  written  in  such  a  way  that  it  can  execute  together  with  Esim.  Two  lists  of 
"corresponding  nodes"  are  defined  by  the  designer.  Slang  will  drive  the  Esim 
nodes  on  the  first  list  with  the  values  that  their  corresponding  nodes  have  in 
Slang,  then  perform  a  simulation  step  at  both  levels,  and  then  compare  the 
values  that  the  "corresponding  nodes"  of  the  second  list  have  in  Slany  and  in 
Esim,  and  print  any  discrepancies.  In  this  way,  the  results  of  the  switch-level 
simulation  are  automatically  checked  against  the  debugged  RTL  description. 
During  RISC  II  debugging,  the  values  of  1300  circuit  nodes  (single  bits)  were 
being  compared  for  equality  after  each  clock  phase  transition.  Besides  the  ease 
provided  by  the  automatic  checking,  the  method  is  also  very  helpful  in  identify¬ 
ing  the  cause  of  a  discrepancy  and  finding  the  offending  layout  error.  When 
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checking  is  done  automatically,  the  values  of  many  nodes  throughout  the  chip 
can  be  checked,  and  thus,  the  first  discrepancy  reported  is  usually  very  close  to 
its  cause. 

The  switch-level  simulation  uncovered  11  mistakes: 

•  one  timing  design  error  on  a  scan-in/scan-out  loop  (this  was  not 
uncovered  by  Slang ,  because  Slang  does  not  know  about  these  loops): 

•  one  missing  flip-flop  design  error  (this  is  the  one  mentioned  at  the  end  of 
§  5.1.1,  that  the  Slang  test  programs  weren’t  testing  for); 

•  one  flip-flop  not  being  cleared  on  as  it  should; 

•  an  error  in  the  programming  of  a  decoder; 

•  one  connection  to  the  wrong  point; 

•  one  case  of  reversed  connections;  and 

•  five  cases  of  connections  to  the  wrong  polarity. 

This  shows  that  most  of  the  errors  were  cases  of  wrong  connection.  Mextra's 
name-checking  did  not  catch  them  either  because  the  node  naming  was  incom¬ 
plete.  or  because  Mextra  only  looks  at  the  names  on  electricly  connected  points 
and  does  not  check  the  consistency  of  input  and  output  signal  names  on  simple 
gates. 


Testing  the  RISC  II  Chips. 


The  RISC  11  layout  was  submitted  for  fabrication  to  MOSJS  at  \-Zfxm,  and  to 
XEROX  PARC  at  A=1.5/im..  Twenty-eight  chips  were  received  back  from  MOSIS 
two  months  later,  and  five  chips  were  received  from  XEROX  in  one  and  a  half 
months.  Five  out  of  the  28  MOSIS  chips  were  rejected  by  visual  inspection,  and  3 
of  the  remaining  5  that  were  bonded  and  tested  were  found  functionally  correct. 
The  fastest  one  of  them  run  at  a  500  nsec  cycle-time.  One  out  of  the  5  XEROX 


chips  was  found  functionally  correct,  except  for  some  bad  bits  in  a  few  registers. 
It  run  at  a  330  nsec  cycle-time. 

All  the  digital  IC’s  designed  in  our  group  at  U.C.Berkeley  during  the  last 
three  years  have  been  debugged  by  simulation  of  the  extracted  layout.  Our 
experience  has  invariably  been  that  chips  carefully  debugged  in  this  way  are 
functionally  correct  on  first  silicon.  This  was  true  for  all  of  the  following  big  pro¬ 
jects: 

.  RISC  I, 

•  FFT  cordic  rotator  (Lioupis,  Wold), 

.  RISCII, 

•  RISC  Instruction  Cache 

It  demonstrates  the  viability  and  the  effectiveness  of  this  debugging  method. 

The  present  section  deals  with  the  functional  testing  of  the  RISC  II  CPU 
chips.  It  describes  the  hardware  set-up  and  the  testing  strategy  that  was  used, 
and  it  discusses  the  usefulness  of  scan-in/scan-out  loops. 

5.2. 1  Testing  Set-up  and  Strategy. 

Figure  5.2.1  shows  the  set-up  that  was  used  for  testing  the  RISC  II  CPU 
chips.  A  Digital  Analyzer  (Tek  DAS-910Q)  was  used  for  pattern  generation  (PG), 
for  data  acquisition,  and  for  comparing  the  acquired  to  the  expected  values.  We 
preferred  the  use  of  an  external  clock  generator,  rather  than  synthesizing  the 
clock  phases  with  the  pattern  generator.  It  reduces  the  number  of  different  pat¬ 
terns  that  need  to  be  generated,  since  the  chip  itself  only  needs  a  new  pattern 
every  4  phases.  The  external  clock  generator  also  allows  short  non-overlap 
periods  and  indi/idually-variable  clock  phases.  External  tri-state  buffers  were 
required,  because  the  particular  DAS  that  was  used  offered  no  tri-state  PC  chan- 


•  hold  the  RESET  pin  high  for  one  cycle  ( NXTPC  *-  80000000#); 

•  putpsw :  PSW  *-  (initialize  CWP,  SWP,  interrupts); 

•  add:  R*  «-  R^imm  (initialize  register  R^). 

Of  course,  it  would  take  about  ISO  cycles  to  initialize  all  138  registers,  but  one 
only  needs  to  do  that  when  testing  for  defects  in  the  register-file.  Once  a  few 
registers  have  been  loaded,  instructions  can  be  tested,  and  the  result  of  any 
instruction  can  be  read  from  the  pins  in  one  cycle: 

•  jump-indexed  always  to  R^+Ra, 

where  /?*  is  the  destination  register  of  the  instruction  to  be  tested.  ( Getpsw , 
getlpc ,  and  PC-relative  instructions  can  be  used  to  read  the  PSW  or  the  PC’s). 
Except  for  the  PSW,  the  PC's,  and  the  registers,  all  other  storage  devices  in  the 
CPU  are  initialized  and  used  within  the  execution  cycle  of  any  instruction  that 
uses  them.  Thus,  they  are  directly  controlable  and  observable  by  that  instruc¬ 
tion. 

The  test  programs  used  were  similar  or  identical  to  those  used  during 
debugging  with  Slang  (§  5.1.1).  The  same  comments  apply  here.  The  tests  per¬ 
formed  are  believed  to  be  fairly  complete,  but  there  is  no  proof  of  that.  Our  ad- 
hoc  approach  to  the  problem  of  test  generation  is  simply  due  to  the  lack  of  a 
good  theory  and  suitable  CAD  tools. 

5.2.2  Scan-In /Scan-Out  Loops. 

Scan-in/scan-out  (SIS0)  capability  [WiAn?3]  [EiWi77]  [FrSpBl]  is  added  to 
chips  in  order  to  increase  the  controlability  and  observability  of  their  internal 
state  from  the  pins.  It  can  be  implemented  by  organizing  all  latches  as  shift- 
registers  and  by  providing  serial  ports  for  reading  and  writing  into  them.  In  the 
RISC  II  CPU,  three  latches,  situated  at  central  positions,  have  SIS0  capability: 
DST,  SRC,  and  the  latch  holding  the  output  of  the  opcode-decoder.  The  two 
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former  ones  form  a  single  64-bit  loop,  while  the  latter  one  forms  a  32-bit  loop  by 
itself.  Each  of  these  two  loops  uses  separate  dedicated  pads  and  wires  for  their 
shift-clock  and  serial-I/O.  In  this  way,  if  the  normal  connections  of  the  CPU  with 
the  external  world  are  defective,  the  loops  may  still  provide  access  to  the  chip's 
interior. 

The  cost  of  the  SISO  loops  in  the  RISC  II  CPU  is  not  high:  Area-wise,  they 
consume  only  about  1%  of  the  data-path,  about  3%  of  the  control  section,  and 
10%  of  the  pads.  Speed-wise,  they  slow  the  machine  cycle  down  by  about  1%. 
However,  the  cost  of  using  the  loops  for  testing  the  chip  is  high.  First,  to  load 
values  into  them,  or  to  read  their  contents,  32  or  64  cycles  are  required,  as 
opposed  to  the  1  to  4  cycles  required  for  accessing  the  same  latches  with  nor¬ 
mal  instructions  (§  5.2.1).  Second,  reading/writing  via  the  loops  requires  that 
the  normal  CPU  clocks  be  stopped  and  that  the  SISO-shift  clocks  be  activated. 
Third,  the  whole  mode  of  operation  of  the  SISO  loops  is  much  different  from  the 
rest  of  the  chip,  thus  requiring  significant  human  effort  and  additional  software 
tools  for  their  use. 

There  are  several  situations  in  which  chips  can  be  put  in  a  test  set-up: 

(1)  for  debugging  a  design; 

(2)  for  debugging  a  fabrication  process,  and  finding  out  what  particular  cir¬ 
cuit  is  not  working  in  a  series  of  defective  dies;  or 

(3)  for  identifying  operational  dies  for  packaging  and  use. 

In  our  view,  number  (l)  above  should  be  used  only  to  verify  that  the  design  is 
correct;  debugging  should  be  done  in  software,  as  discussed  earlier  in  this 
chapter.  Number  (2)  above  can  be  done  with  dies  specifically  designed  for  that 
purpose;  it  does  not  need  to  be  done  with  production  chips.  Thus,  we  believe 
that  the  usefulness  of  SISO  loops  should  only  be  considered  in  the  context  of 
product  testing,  i.e.  purpose  (3)  above.  SISO  loops  can  always  increase  the 
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controlability  and  observability  (C/0)  of  those  dies  where  a  certain  type  of 
defects  would  have  disconnected  the  SISO  latches  from  the  external  world 
without  the  presence  of  the  serial  path.  However,  for  correctly  designed  and 
defect-less  dies,  an  increase  in  testability  is  not  always  present.  In  our  view, 
SISO  loops  should  be  evaluated  according  to  how  much  they  increase  the  C/0  of 
operational  chips. 

In  RISC  II,  the  data-path  SISO  loop  offers  no  increase  in  C/0  in  a  correctly 
working  chip,  because  normal  instructions  can  also  be  used  to  copy  values 
between  registers  and  DST  or  SRC.  The  control  SISO  does  increase  the  C/0  of 
the  opcode-decoder’s  output,  because  normal  instructions  cannot  read  that  out¬ 
put;  neither  are  there  any  instructions  that  can  load  an  arbitrary  value  into 
these  latches.  However,  we  consider  that  increase  in  C/0  to  be  of  limited  useful¬ 
ness.  The  bits  in  that  latch  control  so  many  things  throughout  the  CPU,  that 
even  if  only  one  of  them  is  incorrect,  chances  are  that  most  instructions  will  not 
work  at  all.  From  that  point  of  view,  the  observability  of  the  latch  is  very  high. 
On  the  other  hand,  loading  an  arbitrary  pattern  into  that  latch  —  one  that  does 
not  correspond  to  any  real  instruction  —  is  of  very  limited  usefulness.  It  is  nor¬ 
mally  very  difficult  to  devise  new  ways  of  getting  a  data-path  to  perform  useful 
transfers  that  do  not  already  exist;  this  is  especially  true  in  RISC  II,  where  the 
timing  information  is  hardwired  in  the  timing  gates. 

For  all  the  above  reasons,  the  SISO  loops  in  the  RISC  II  CPU  have  not  been 
used  in  testing  it,  just  like  it  had  happened  with  RISC  I  [FoVP82].  For  these  rea¬ 
sons,  we  consider  this  style  of  design-for-testability  to  have  limited  usefulness  in 
the  case  of  micro-architectures  with  readily  accessible  internal  state.  For  bus- 
oriented  chips,  with  reduced  C/0  of  their  internal  state,  we  suggest  that  an 
alternative  style  of  design-for-testability  be  considered  before  resorting  to  SISO 
loops.  When  a  latch  is  close  to  some  bus,  consider  connecting  it  to  that  bus  for 


test  purposes;  this  may  not  be  more  expensive  than  adding  SISO  capability. 
Such  a  parallel  connection  will  usually  be  faster  and  easier  to  use  than  a  SISO 
loop.  Of  course,  if  all  latches  are  connected  to  a  single  bus,  they  will  all  become 
inaccessible  if  that  bus  does  not  work;  but  -  again  -  we  are  only  interested  in 
testing  for  working  chips.  For  chips  that  are  dominated  by  random  logic  and 
that  have  few  or  no  busses,  SISO  loops  may  still  be  a  good  solution. 
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CHAPTER  6: 


ADDITIONAL 
HARDWARE  SUPPORT 
FOR  GENERAL-PURPOSE 
COMPUTATIONS. 


Chapter  2  studied  the  nature  of  general-purpose  computations  as  expressed 
in  von  Neumann  languages.  It  was  seen  that  a  few  simple  operations  account  for 
most  of  the  execution  time  and  that  high  performance  depends  mostly  on 

•  exploitation  of  fine-grain  parallelism, 

•  fast  addressing  and  operand  accessing, 

•  fast  decision  making  and  branching,  and 

•  fast  floating-point  operations  (in  numeric  applications). 

This  suggests  that  it  is  more  effective  to  use  special  devices  that  provide  fast 
access  to  instructions  and  operands  than  to  use  precious  chip  area  for  the 
implementation  of  complex  instructions. 

The  Berkeley  Reduced-Instruction-Set-Computer  experiment,  which  was 
presented  in  chapters  3,  4,  and  5,  has  investigated  this  direction  in  computer 
architecture.  RISC  I  and  II  provide  pipelined  execution  of  simple  instructions  in 
an  environment  where  local  variables  are  readily  accessible.  Section  3.4  gave  an 
evaluation  of  the  experiment,  showing  both  the  viability  and  the  advantages  of 


simple  Instruction  sets. 

Any  hardware  resources  that  remain  available  after  a  pipelined  data-path 
and  its  simple  controller  have  been  implemented  should  be  spent  as  effectively 
as  possible  for  increasing  performance.  According  to  chapter  2,  this  means  pro¬ 
viding  fast  access  to  the  most  frequently  used  operands,  fast  compare-and- 
branch  operations,  and  fast  number  crunching  for  numeric  applications.  The 
first  two  of  these  issues  are  considered  in  this  chapter.  The  latter  one  — 
number-crunching  —  is  not  considered  in  this  dissertation. 

Enhancements  intended  to  providing  the  above  support  should  be  included 
in  a  processor  according  to  a  "priority  list”  that  depends  on  the  hardware 
resources  (e.g.  silicon  area)  available  at  a  given  time.  We  believe  that  these 
priorities  are  as  follows: 

1.  Register  File  for  frequently  used  scalar  variables, 

2.  Instruction  Cache  with  support  for  fast  decision  making, 

3.  Data  Cache  for  non-scalar  operands. 

This  ordering  results  from  their  relative  cost  and  pay-off.  A  register  file,  even  of 
modest  size,  for  scalar  variables  allows  high  performance  gains.  An  instruction 
cache  is  larger  in  size  but  is  essential  in  feeding  a  fast  data-path  with  new 
instructions.  A  data  cache  has  less  of  a  visible  effect  in  an  architecture  where 
many  of  the  operands,  namely  the  scalar  variables,  are  already  in  registers. 
Separate  instruction  and  data  caches  are  proposed  here  for  two  reasons.  First, 
independent  memory  ports  are  desirable  for  parallel  instruction-fetching  and 
data-accessing  (§  3.3.2,  3.3.3);  second,  each  one  of  these  two  cache  types  can  be 
structured  in  a  different  way  to  take  best  advantage  of  the  peculiarities  of  its 
usage.  Such  organizations  are  proposed  below  in  §  6.3  and  §  6.4,  after  the  issue 
of  fast  access  to  scalar  variables  has  been  discussed  in  §  6.1  and  §  6.2.  Section 
6.5  deals  with  another  important  issue:  moving  data  into  and  out  of  a  processor’s 
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main  memory. 


6. 1  Multi-Window  Register  Files 

versus  Cache  Memories 
for  Scalar  Variables. 

Section  3.2  presented  the  organization  of  the  RISC  multi-window  register 
file  and  explained  how  it  provides  fast  access  to  the  frequently  used  local  scalar 
variables.  That  register  file  acts  as  a  small  and  fast  buffer  for  the  top  of  the 
stack  of  procedure  activation  records,  that  is,  for  the  most  recently  used  local 
variables.  In  that  respect,  a  multi-window  register  file  is  similar  to  a  cache 
memory.  This  section  investigates  the  similarities  and  the  differences  between 
them. 

6. 1. 1  The  Various  Kinds  of  Locality  of  Reference. 

The  locality  of  the  memory  references  made  by  a  program  has  usually  been 
studied  by  statistical  methods  in  a  "black  box”  approach,  without  looking  at  the 
underlying  program  properties  which  cause  it  (see  for  example  [Smit82]).  How¬ 
ever,  a  study  of  the  way  programs  access  memory,  like  the  study  in  chapter  2, 
shows  that  this  locality  has  interesting  properties  arising  from  the  nature  of 
computations.  Memory  references  can  be  distinguished  into  three  categories, 
with  different  locality  properties  each: 

•  Instructions:  Instruction-fetches  are  read-only  accesses.  They  are 
sequential  in  small  blocks  —  between  if  or  coll  or  loop  statements.  Local¬ 
ity  arises  from  the  repeated  accesses  to  instructions  inside  loops.  Since 
programs  spend  most  of  their  time  in  small  inner  loops,  this  locality  is 
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high. 

•  Scalar  Vxriables:  Scalars  occupy  a  single  memory  location,  and  hence, 
that  fixed  location  is  accessed  whenever  the  variable  is  used.  As  noted 
throughout  chapter  2,  some  scalar  variables  are  heavily  used  during  exe¬ 
cution.  These  tend  to  be  few  in  number,  declared  locally  in  their  pro¬ 
cedure,  and  used  as  array-indexes,  counters,  pointers,  flags,  or  tem¬ 
porary  storage  locations.  For  example,  in  the  critical  loop  of  fgrep  (fig. 
2.4.1),  out  of  the  29  operand  accesses  per  iteration,  18  are  made  to  4 
local  scalars,  2  are  made  to  1  global  scalar,  and  9  are  made  to  non¬ 
scalars.  These  facts  show  the  high  locality  of  the  references  to  the  few 
scalars  in  critical  loops.  They  are  not  unrelated  to  the  way  the  human 
mind  works  by  hierarchically  breaking  big  tasks  into  smaller  ones,  and  by 
only  dealing  with  a  few  objects/concepts  at  each  level. 

•  Nan-Scalar  lfariables:  Arrays  and  structures  occupy  many  memory  loca¬ 
tions  each,  and  accesses  are  not  made  to  the  same  location  each  time. 
Usually,  certain  elements  of  a  few  non-scalar  variables  are  accessed  once 
or  a  few  times,  and  then  accesses  shift  to  "neighboring”  elements  of  those 
variables  (chapter  2).  "Neighborhood”  here,  may  or  may  not  be  actual 
proximity  in  virtual  address  space,  depending  on  the  type  of  the  accessed 
data  structure. 


Cache  memories  usually  treat  all  memory  references  alike  and  base  their  opera¬ 
tion  on  the  average  statistical  observations.  Some  computers  have  dedicated 
cache(s)  for  instructions  and/or  data.  The  rest  of  this  chapter  (except  for  §  6.5) 
deals  with  alternatives  to  this  organization. 


6.1.2  Comparison  of  Registers  and  Caches  for  Scalars. 

The  organization  of  the  circular  buffer  of  N  register  windows  in  the  RISC 
architecture  (§  3.2)  is  such  that  the  ^-1  most  recent  procedure  activation 
records  are  kept  in  that  register  file  (except  for  the  parts  of  activation  records 
which  do  not  fit  in  one  register  window).  A  cache  memory  of  size  U  blocks,  on 
the  other  hand,  is  organized  so  that  the  U  most  recently  used  memory  blocks 
are  kept  in  it.  Since,  in  those  organizations,  procedure  activation  records  are 
kept  in  a  LIFO  memory  stack,  it  follows  that  those  part  9  of  the  most  recent 
activation  records  that  have  actually  been  used  in  the  recent  past  are  kept  in 
the  cache  memory.  Thus,  a  multi-window  register  file  approximately  holds  a 
subset  of  what  a  cache  memory  holds,  and  both  of  those  devices  are  fast 


memories  intended  to  provide  quick  access  to  their  contents.  The  similarities 
and  the  differences  among  the  two  approaches  will  be  further  investigated  here. 

A  register  file  will  hold  all  the  local  scalars  of  a  procedure,  or  a  random 
subset  of  them  in  the  rare  case  where  there  are  more  than  can  fit  into  a  window. 
A  cache,  on  the  other  hand,  will  only  hold  those  local  scalars  which  have  actually 
been  used  in  the  recent  past.  In  this  respect,  the  cache  memory  is  better,  since 
it  adapts  itself  dynamically  rather  than  statically  to  the  demands  of  the  compu¬ 
tation.  Two  negative  effects,  however,  make  this  adaptability  worse  than  what  it 
might  be  theoretically.  First,  whole  biocfcs  containing  the  most  frequently  used 
scalars  are  kept  in  the  cache,  not  just  the  words  themselves.  This  increases  the 
probability  that  the  precious  on-chip  memory  locations  hold  unused  data. 
Second,  most  caches  are  set-associative  with  a  small  set-size.  In  that  case, 
other  data  (or  instructions)  may  overwrite  some  of  the  recently  used  local 
scalars.  The  difference  in  the  adaptability  of  register  files  and  caches  is  also 
reduced  by  the  fact  that  most  procedures  have  a  few  local  scalars  and  use  them 
heavily  (§  2.2.2),  so  that  the  static  prediction  is  not  far  from  the  dynamic  situa¬ 
tion.  On  the  other  hand,  fixed-size  window  schemes  waste  part  of  the  window 
when  the  activation  record  is  small  (see  next  section). 

At  this  point,  it  is  worth  discussing  the  issue  of  global  scalars,  as  well. 
There  are  usually  many  global  scalars,  but  only  a  few  of  them  are  heavily  used. 
For  example,  fgrep  (§  2.4.1)  has  20  global  scalars  declared,  but  only  one  of 
them  is  used  in  the  critical  loop.  Cache  memories  will  dynamically  discover 
those  variables  and  hold  them.  For  a  compiler,  on  the  other  hand,  it  is  very 
difficult  --  if  not  impossible  —  to  determine  which  ones  are  frequently  used  and 
to  allocate  them  in  registers.  A  viable  approach  requires  the  programmer  to 
give  hints  to  the  compiler,  using  register -type  declarations  like  the  ones  that  C 
allows  for  local  scalars.  The  same  declarations  would  be  useful  for  those  few 


procedures  which  have  more  local  scalars  than  a  window  can  hold. 

While  the  above  comparisons  did  not  show  a  decisive  difference  between  a 
cache  and  a  register  file,  these  differ  strongly  when  addressing  overhead  is  con¬ 
sidered.  Caches  offer  addressing  transparency  at  the  machine-language  level,  at 
the  expense  of  always  referencing  objects  by  their  full,  long  identifier  (address). 
Registers  offer  no  such  transparency  at  that  level,  but  they  allow  referencing  of 
objects  by  short  identifiers.  Addressing  transparency  for  registers  is  offered  in 
the  HLL  domain,  which  is  all  that  matters  for  the  programmer.  The  effect  of  the 
identifier  size  is  very  important,  both  in  terms  of  accessing  delay,  and  in  terms 
of  hardware  requirements. 


Instruction: 


Figure  6.1.1: 

Referencing  a  local  scalar 


Instruction: 


Figure  8.1.1  illustrates  these  points.  To  reference  a  local  scalar  in  the  RISC 
11  multi-window  register  file,  8  bits  of  information  are  decoded  in  a  specialized 
decoder,  and  one  out  of  138  registers  is  activated  and  places  its  contents  onto 


the  corresponding  bus.  To  reference  a  local  scalar  on  the  execution  stack,  the 
stack-pointer  SP  must  be  selected  and  gated  into  an  adder,  where  the  short 
offset  constant  out  of  the  instruction  must  be  added  to  it.  This  is  a  long  addi¬ 
tion  which  produces  a  full-length  memory  address.  The  cache  uses  the  LS-part 
of  that  address  to  access  a  wide  RAM,  in  order  to  read  a  number  of  words  and 
tags  equal  to  the  set  size.  The  MS-part  of  the  address  is  compared  to  the  tags, 
and  one  of  the  words  that  were  read  is  selected. 

It  is  clear  that  the  long  address  addition  required  to  access  a  local  scalar  on 
the  stack  makes  such  cache  references  slower  than  a  register  reference,  even  if 
the  cache  itself  is  equally  fast  as  the  register  file.  Pipelining  can  alleviate  that, 
but  the  gains  are  limited  by  the  occurrence  of  jump  instructions.  In  most  prac¬ 
tical  situations,  the  high  cost  of  the  hardware  required  to  make  cache  refer¬ 
ences  fast  enough  can  not  be  afforded,  and  thus  significant  delays  are  intro¬ 
duced.  Such  examples  are  the  following: 


•  The  stack-pointer  may  be  located  in  a  general  register  file,  thus  requiring 
a  register  access  for  it  to  be  read. 

•  Dedicated  ALU’s  may  not  exist  for  computing  the  address  of  each  of  an 
instruction's  two  operands  and  for  performing  the  arithmetic  operations 
of  the  other  instruction(s)  in  the  pipeline. 

•  For  a  cache  to  be  effective,  it  has  to  be  larger  than  a  multi-window  regis¬ 
ter  file  (not  counting  tags),  because  it  has  to  hold  all  kinds  of  data  --  not 
just  local  scalars.  This  fact,  combined  with  the  tag  comparison  and  word- 
selection,  which  are  in  series  with  RAM-reading,  will  normally  make  the 
cache  access  slower  than  a  register  access. 

•  A  dual  operand  read-access  to  a  dual-ported  register  file  can  be  achieved 
with  a  register  width  equal  to  the  word  size.  A  2-way  set-associative 
cache,  which  is  normally  not  dual-ported,  requires  that  4  words  and  4  tags 
be  read.  Such  large  RAM  widths  may  not  be  affordable  for  an  on-chip 
cache. 

•  The  communications  bandwidth  requirements  are  higher  for  the  cache, 
due  to  the  full-width  memory  address.  This  is  an  important  bottleneck  for 
off-chip  caches. 


The  combined  effects  of  the  above  points  can  be  seen  in  the  following  table.  It 
compares  the  speeds  at  which  the  PDP-11/70  and  the  VAX-11/780  can  access 


local  integers  in  their  registers  and  in  their  cache: 


PDP-11/70  |  VAX-11/780 
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The  superiority  of  multi-window  register  files  over  cache  memories,  for 
keeping  scalar  variables,  is  thus  clear.  There  are  two  fundamental  reasons  for 
that.  Firstly,  scalars  differ  from  data-structures  in  both  their  properties  and 
their  usage.  Scalars  are  few,  they  are  referenced  repeatedly,  and  they  are  used 
to  name  data-structure  elements  (e.g.  pointers,  array-indexes).  Secondly,  when 
the  sources  of  computations  can  conveniently  be  segregated  into  different 
storage  devices,  parallel  access  to  them  becomes  possible.  Such  is  the  case 
with  instructions,  scalars,  and  data-structures. 


6.2  Fixed-Size,  Variable-Size,  and  Dribble-Back 

Multi-Window  Register  Files. 


The  RISC  1  and  II  register  file  organization  has  fixed-size  windows,  and  copies 
these  windows  completely  to /from  memory  when  overflows  or  underflows  occur 
(§  3.2).  That  is  not  the  only  possible  organization  for  a  multi-window  register 
file.  Two  alternative  schemes  lire  discussed  in  this  section:  register  files  with 
variable-size  windows  and  "dribble-back"  register  filer  that  save  and  restore  win¬ 
dows  "in  the  background"  in  parallel  with  normal  instruction  execution. 


6.2. 1  Variable-Size-Window  Register  Tiles. 


Allocating  procedure  arguments  and  local  scalars  into  registers  is  possible, 
in  the  RISC  window  scheme,  because  the  number  of  those  scalars  is  quite  small 
most  of  the  time,  and  they  can  thus  all  fit  into  the  current  window.  The  meas¬ 
urements  in  [HaKeBO]  (§  2.2.2)  showed  that  the  number  of  these  arguments  and 
locals  is  smaller  than  13  in  more  than  95%  of  the  executed  procedure  calls.  The 
RISC  11  register  file  has  enough  space  for  15  arguments  and  local  scalars  in  each 
of  its  fixed-size  windows  (the  16th  register  is  used  for  the  return-PC). 

While  few  of  the  procedures  would  need  more  registers  than  a  window  has, 
many  of  the  procedures  use  only  a  few  of  the  available  registers  in  a  window. 
According  to  the  dynamic  measurements  in  [HaKeBO],  a  procedure  activation 
record  needs  4.6  ±  1.3  registers  on  the  average  for  its  arguments  and  locals. 
According  to  similar  -  but  static  -  measurements  in  [DiMLB2],  that  number  is  5.7 
words  per  procedure  on  the  average  f.  What  this  means  is  that  a  large  portion  of 
a  fixed-size  window  remains  unexploited  most  of  the  time,  if  that  window  is  large 
enough  for  most  of  the  activation  records  to  fit  in  it.  According  to  the  above 
numbers,  two  thirds  of  the  RISC  II  registers  remain  unutilized,  on  the  average. 
Thus,  sizable  silicon  resources  are  wasted;  this  is  a  serious  drawback  of  the 
fixed-size  window  scheme. 


t  Both  measurements  include  all  locals  —  scalars  and  non-scalars.  However,  the  averages 
given  here  only  take  into  consideration  those  procedure  activations  which  required  £  24 
words  for  their  arguments  and  locals,  because  it  may  safely  be  assumed  that  larger  require¬ 
ments  arise  only  out  of  local  non-scalars.  The  first  number  is  the  average  of  the  9  dynamic 
averages  for  the  9  measured  programs.  The  second  number  is  the  average  over  the  1400 
statically  defined  procedures  in  all  of  the  standard  UKDC  commands  (/usr/arc/cmd/*.c). 
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Figure  6.2.1:  Register  Decoders  for  Multi-Window  Schemes 
(concept  and  implementation). 


The  alternative  ia  to  use  a  register  file  with  variable-size  windows  in  which 
each  window  is  only  as  large  as  is  needed.  Overall,  such  a  register  file  needs 
fewer  registers,  because  of  the  improved  utilization.  This  has  severed  desirable 
effects: 

•  transistors  are  freed  and  can  be  used  for  other  functions  on  the  chip, 

•  the  register  file  is  faster,  due  to  its  smaller  size  and  correspondingly 
smaller  parasitics  (see  [Sher83]),  and 

•  saving  and  restoring  registers  into/from  memory  is  faster,  since  no 
unused  registers  are  copied. 

Of  course,  a  maximum  window  size  is  always  imposed  by  the  total  register-file 
size  and  by  the  number  of  bits  available  in  the  Instruction  format  for  specifying 
a  register-number.  However,  In  this  variable-window-size  scheme,  whenever  a 
child  procedure  is  called,  the  current  window  pointer  only  moves  from  its  previ¬ 
ous  position  by  as  many  registers  as  the  parent  procedure  actually  uses,  instead 
of  moving  by  a  fixed  predefined  distance.  In  this  scheme,  which  is  closer  to  the 


traditional  stack  of  activation  records,  windows  may  "begin"  (be  aligned)  on  any 
arbitrary  register  in  the  register  file.  That  means  that  the  Current  Window 
Pointer  ( CWP )  must  now  have  single-register  resolution  in  pointing  to  the  begin¬ 
ning  of  the  current  window.  The  register  addressing  process  can  no  longer  be 
done  with  a  simple  AND-OR  decoder  —  an  addition  must  be  performed,  instead. 
Figure  6.2.1  illustrates  these  points.  The  required  6-  or  7-bit  adder  in  series  with 
the  register  decoder  will  slow-down  the  decoding  of  register-numbers,  which  is 
on  part  of  the  critical  path  of  the  execution  phase  (fig.  4.2.1).  Nevertheless,  the 
delay  of  a  carefully  designed  small  adder  needs  not  be  much  longer  than  the 
extra  delay  caused  by  the  OR-AND-INVERT  gates  required  for  decoding  the  over¬ 
lap  registers  in  the  fixed-size  scheme  (§  4.2.3).  This,  coupled  with  the  smaller 
and  faster  register  file,  may  make  the  variable-size  scheme  quite  attractive. 

Another  penalty  that  must  be  paid  in  the  variable-size  window  scheme  is  the 
additional  overhead  per  call-return  pair  for  updating  the  CWP  and  checking  for 
overflows  or  underflows.  These  tasks  cannot  be  carried  out  by  hardwired 
decrement/increment  and  compare  operations  as  in  RISC  II.  Instead,  the 
number  to  be  subtracted  or  added  to  CWP,  and  the  distance  of  CWP  from  SWP 
to  be  checked  for  over/under-flow  detection,  depend  on  the  number  of  argu¬ 
ments  and  locals  of  the  parent  or  child  procedures.  These  pairs  of  procedures 
may  be  separately  compiled,  in  languages  like  C,  and  thus  their  respective 
requirements  are  not  known  at  compilation  time.  Either  the  linkage  editor 
should  be  used  to  patch  the  call/return  statements,  or  an  additional  instruction 
per  call-return  pair  is  required.  Figure  6.2.2  shows  two  of  the  available  options 
for  updating  and  checking  CWP  in  the  variable-size  scheme. 

•  In  (a),  CWP  points  to  the  base  (the  "beginning")  of  the  current  window. 
When  a  parent  cells  a  child,  CWP  is  changed  by  the  size  of  the  parent’s 
frame,  and  availability  is  checked  for  a  maximum-size  new  window.  These 
can  both  be  performed  by  the  call  instruction.  When  the  child  returns, 
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Figure  6.2.2:  Update-&-Check  Options 
with  Variable-Size  Scheme. 


CWP  has  to  be  changed  by  the  size  of  the  parent 's  frame  again.  However, 
that  size  is  not  known  to  the  child  procedure  which  executes  the  return. 
instruction,  unless  the  linkage  editor  patches  the  code.  In  the  absence  of 
such  patching,  an  extra  instruction  has  to  be  inserted  after  each  call,  to 
"catch"  the  return,  to  update  CWP,  and  to  check  whether  the  parent’s 
registers  are  still  present  in  the  register  file.  Alternatively,  the  parent’s 
CWP  value  may  be  passed  to  the  child  along  with  the  return-FC. 

Part  (b)  of  figure  8.2.2  shows  the  second  option,  in  which  the  CWP  points 
to  the  "limit"  of  the  current  window.  In  that  case,  CWP  has  to  be  changed 
by  the  size  of  the  child’s  frame  upon  calls  and  returns.  When  the  return 
instruction  is  executed,  that  size  is  known,  and  no  problem  exists  (the 
check  for  validity  of  the  parent’s  registers  is  made  for  the  maximum  pos¬ 
sible  size  of  the  parent’s  ;!rame).  However,  the  call  instruction  is  exe¬ 
cuted  by  the  parent,  and  thus  an  extra  instruction  has  to  be  inserted  at 
the  entrance  of  the  child  procedure.  This  scheme  inserts  statically  less 
extra  instructions  (dynamically  both  schemes  execute  the  same  number 
of  extra  instructions),  but  when  the  child  checks  for  free  space  it  must 
check  for  the  sum  of  its  own  frame  plus  the  maximum  number  of  outgoing 
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arguments  of  all  of  its  caff's.  With  this  scheme,  passing  the  parent's  CWP 
along  with  the  return-PC  does  not  work. 

•  A  third  option,  used  in  Ditzel's  C  Machine  Register-Stack  [DiML82],  is  to 
insert  extra  instructions  both  at  the  entry-point  of  every  procedure  and 
at  the  target  of  every  return  instruction.  In  this  way,  an  accurate  check 
for  over/under-flow  is  possible  on  both  call's  and  return's. 

Ditzel's  "Stack  Cache  Register  Set"  for  the  C  Machine,  described  in 
[DiML82],  is  similar  to  the  variable-size  window  scheme,  except  that  the  CWP  is 
extended  to  be  a  full  32-bit  memory  address.  In  that  way.  registers  are  always 
accessed  with  their  equivalent  memory  address.  To  avoid  the  high  penalty  of 
performing  the  long  addition  CWP+of fset  once  per  register  access,  that  addi¬ 
tion  is  performed  at  the  time  the  instruction  is  fetched  into  the  instruction 
cache.  This  scheme  exploits  the  statistical  fact  that  many  of  the  non-recursive 
procedures  are  called  with  the  same  CWP  many  or  all  the  times.  When  this  is 
not  the  case  (with  recursive  procedures  for  example),  the  procedure  code  is  re¬ 
fetched  into  the  cache  upon  the  new  procedure  activation.  The  accessing  of  the 
registers  with  memory  addresses  makes  this  scheme  be  quite  similar  to  a  cache 
memory  (§  6.1.2).  Its  fundamental  difference  from  a  cache  is  that  it  is  only  used 
for  the  top  of  the  execution  stack.  In  this  way,  parallel  access  to  it  and  to  the 
rest  of  the  memory  is  possible,  and  it  is  managed  as  a  single  circular  buffer  with 
no  address  tags,  no  LRU  replacement,  and  no  set-associativity. 

6.2.2  Dribble-Back  Register  flies. 

In  the  multi-window  schemes  that  were  examined  up  to  now,  saving  and  res¬ 
toring  windows  to/from  memory  was  only  done  on  overflows  and  underflows 
respectively.  These  schemes  are  successful  when  enough  windows  exist,  so  that 
overflows  and  underflows  rarely  occur.  An  alternative  scheme  is  to  perform  the 
saving  and  restoring  before  a  need  for  it  arises  and  to  perform  it  "in  the  back¬ 
ground!',  that  is,  in  parallel  with  normal  instruction  execution.  As  long  as  this 


background  copying  does  not  slow  down  program  execution,  it  doesn’t  matter 
how  frequently  windows  are  saved  or  restored.  Thus,  it  is  possible  to  have  very 
few  windows  in  a  register  file  that  is  managed  with  this  method.  This  kind  of 
management  was  proposed  by  Sites  in  [3ite79],  who  used  the  name  "dribble- 
back"  to  describe  it. 

The  advantage  of  a  dribble-back  register  file  is  that  it  can  be  small  in  size, 
and  thus  fast  in  operation  ([Sher83]).  Also,  in  the  ideal  case,  it  will  never 
overflow  or  underflow.  Its  disadvantage  is  the  high  memory  bandwidth  which  it 
requires  for  saving/restoring  registers  in  parallel  with  normal  instruction  execu¬ 
tion.  Thus,  dribble-back  register  files  are  attractive  for  high-performance  sys¬ 
tems,  where  the  cost  of  the  extra  bandwidth  may  be  affordable.  To  provide 
sufficient  bandwidth,  one  might  use  a  pipelined  cache  that  permits  two  accesses 
per  machine  cycle  —  one  for  the  executing  instruction  and  one  for  the 
saving/restoring  process.  Alternatively,  separate  instruction  and  data  caches 
may  be  utilized.  In  processors  employing  register  windows,  the  data-memory- 
port  is  often  left  idle.  In  the  measurements  reported  in  §  3.2.2,  only  about  one 
fifth  of  all  executed  RISC  instructions  were  load  or  sfore.  The  idle  memory 
cycles  can  be  used  in  the  background  for  saving  or  restoring  registers. 

Figure  8.2.3  shows  a  dribble-back  register  file  with  two  windows  —  the 
minimum  possible  size. 

(a)  shows  a  data-movement  organization  selected  for  ease  of  understanding 
the  operation  of  the  scheme.  Upon  execution  of  a  call  instruction,  the  local 
and  input-argument  registers  of  the  parent  procedure  axe  copied  (saved) 
into  a  back-up  set  of  registers  (window).  Simultaneously,  the  cutput- 
argument  registers  are  copied  into  the  input-argument  ones.  Now,  the 
chile  procedure  can  start  executing,  with  its  input  arguments  in  the;  input- 
registers,  and  with  the  local  and  output  registers  free  to  be  used.  In  paral¬ 
lel  with  the  child’s  execution,  the  back-up  window  is  copied  (saved)  to 
memory,  in  preparation  for  the  event  of  another  call  occurrence.  If  that 
other  call  does  not  occur,  then  upon  execution  of  the  refum  instruction  by 
the  child,  the  input-registers  are  copied  into  the  output-registers,  thus 
returning  values  to  the  parent,  and  the  back-up  set  of  registers  is  copied 
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figure  6.2.3:  Dribble-Back  Register  File. 


into  the*  local  and  input  registers,  thus  restoring  the  parent's  activation 
record.  Now,  in  parallel  with  the  parent  procedure  continuing  execution, 
the  back-up  window  must  be  prepared  for  the  event  of  another  return 
instruction;  the  locals  and  input-arguments  of  the  grand-parent  must  be 
copied  (restored)  from  memory  into  those  back-up  registers. 


Thus,  the  following  management  schedule  is  followed:  After  a  caii  instruction, 
prepare  for  a  new  call  by  saving  the  parent’s  frame.  After  a  return  instruction, 
prepare  for  a  newrefum  by  restoring  the  parent’s  frame. 


Part  (b)  of  figure  6.2.3  shows  an  organization  that  is  preferable  for  imple¬ 
mentation.  Here,  pointers  are  changed,  instead  of  moving  data  from  one  window 
into  another.  In  this  organization  each  register  cell  only  needs  connections  to 


the  CPU  bus(es)  and  to  the  background -port  bus.  In  organization  (a),  register 
cells  need  connection  to  one  less  bus,  but  they  need  additional  shift-type  con¬ 
nections  to  their  neighbours.  These  shift-type  connections  are  more  expensive 
than  normal  bus  connections  in  terms  of  silicon  area;  thus  register-file  (b)  is 
more  compact  than  file  (a).  Another  disadvantage  of  organization  (a)  is  that  all 
registers  are  copied  on  call /return's,  thus  requiring  extra  power  for  data 
transfer.  For  this  reason,  organization  (b)  was  also  preferred  for  the  RISC  regis¬ 
ter  file. 

The  main  advantage  of  dribble-back  register  files  is  their  smaller  physical 
size,  resulting  from  the  low  number  of  windows  required.  Successful  perfor¬ 
mance  of  the  minimum-size  dribble-back  register  file  is  critically  dependent  on 
whether  enough  time  is  usually  available  between  two  successive  procedure  calls 
for  a  window  to  be  saved  in  memory,  and  between  two  successive  procedure 
returns  for  a  window  to  be  restored  from  memory.  To  evaluate  this,  the  profiled 
code  of  the  sed  and  mexfra  programs  (see  §  2.4.2  and  §  2.4.4)  was  analyzed  by 
hand,  yielding  the  following  dynamic  measurements: 

•  When  procedures  are  called  and  start  executing: 

•  »  1/2  of  them  call  no  further  children; 

•  »  1/3  of  them  call  another  procedure  after  executing  0  to  4  HLL 
statements  (that  is  0  to  10  machine  instructions);  and 

•  the  remaining  »  1/8  of  them  call  another  procedure  after  execut¬ 
ing  6  to  10  HLL  statements. 

•  After  a  child  returns  to  its  parent: 

•  In  50  or  70  7,  of  the  cases,  another  procedure  is  called  in  a  while; 

•  in  35  or  10  X  of  the  cases,  the  parent  returns  after  executing  0  to  1 
HLL  statements  (w  0  to  3  machine  instructions);  and 

•  in  15  or  20  7,  of  the  cases,  the  parent  returns  after  executing  4  or 
more  HLL  statements. 


These  numbers  mean  that,  with  a  2-window  dribble-back  register  file,  roughly  30 
%  of  the  calls  or  returns  will  have  to  wait  because  the  back-up  window  is  not  yet 
ready  —  unless  the  background-coping  memory  port  has  a  bandwidth  of  several 
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words  per  machine  cycle.  Thus,  effective  multi-window  register  flies  with  very 
few  windows  are  not  easily  achieved  with  the  dribble-back  scheme. 


6.3  Support  for  Fast  Instruction  Fetching 

and  Sequencing. 

The  process  of  fetching  instructions  performs  two  basic  functions: 

•  supplying  "fuel''  —  Le.  instructions  —  to  the  execution  unit,  in  order  for  the 
computation  to  proceed,  and 

•  guiding  the  computation  onto  the  proper  path,  according  to  decisions  dynam¬ 
ically  made  in  the  execution  unit. 

In  a  high-performance  processor,  where  simple  instructions  control  a  pipelined 
data-path,  it  is  important  for  both  of  these  functions  to  be  fast.  This  section 
investigates  hardware  and  architectural  support  for  achieving  that  goal. 

The  various  organizations  proposed  in  this  section  are  centered  around  the 
use  of  an  instruction  cache.  As  mentioned  in  the  introduction  of  this  chapter, 
an  instruction  cache  is  one  of  the  desirable  hardware  enhancements  for  a  high- 
performance  processor,  for  several  reasons: 


•  a  cache  for  instructions  is  an  effective  device,  because  it  exploits  the 
locality  of  references  arising  out  of  loops  in  programs; 

•  a  cache  that  is  dedicated  to  instructions  is  simpler  than  a  general  cache, 
because  it  is  read-only; 

•  an  instruction  cache  which  is  separate  from  the  data-memory  port  of  the 
CPU  is  desirable  for  allowing  parallel  instruction  and  data  accesses  (§ 
3.3.2); 

•  an  on-chip  instruction  cache  utilizes  the  silicon  area  more  effectively  than 
microcode  ROM,  because  it  dynamically  adapts  its  contents  to  the 
requirements  of  the  executing  program. 


6.3 
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•  an  independent  instruction  cache  lends  itself  to  incorporation  into  an 
instruction  fetch-and-sequence  unit  (see  below). 

An  alternative  to  an  instruction  cache  is  a  single  or  multiple  instruction  buffer. 
Such  buffers  are  simpler  than  a  general  cache  and  rely  on  the  usual  small  size  of 
critical  loops.  However,  the  effectiveness  of  buffer  schemes  is  limited  by  the 
fact  that  each  iteration  of  a  critical  loop  often  consists  of  the  execution  of 
several  small  rum-contiguous  blocks  of  instructions,  rather  than  of  a  single  con¬ 
tiguous  block  that  could  fit  in  an  instruction  buffer.  As  an  example,  the  small 
critical  loop  of  fgrep  (§  2.4.1),  which  consists  of  the  execution  of  only  11  lines  of 
source  code,  actually  extends  over  two  pages  of  source  program.  Since  about 
one  out  of  4  to  6  executed  instructions  is  a  successful  conditional  branch  or 
call/return,  the  average  size  of  blocks  of  contiguously  executed  instructions  is 
only  about  4  to  6  instructions  (see  §  8.3.2,  §  2.2.1). 

6.3.1  Remote-PC  Instruction  Units. 

The  instruction  fetching  process  enjoys  a  significant  degree  of  indepen¬ 
dence  from  the  computation  process.  That  independence  is  the  basis  of  a  desir¬ 
able  hardware  partitioning  into  separate  fetching  and  execution  units,  allowing 
for  both  locality  of  information  processing  and  parallelism  of  operation. 

Figure  6.3.1  shows  an  organization  that  minimizes  the  communication 
bandwidth  between  an  instruction  fetch-4-sequence  unit  and  the  unit  which  exe¬ 
cutes  instructions.  The  Program  Counter  (PC)  is  contained  in  the  former  unit, 
hence  the  name  "remota-PC  scheme.  The  fetch-Jc-sequence  unit  understands 
and  executes  control-transfer  instructions  --  jumps,  calls,  and  returns.  For  that 
unit,  two  characteristics  of  the  instruction  sequencing  mode  are  important: 

•  Conditional/ Unconditional  Sequencing :  For  conditional  transfer  instruc¬ 
tions,  a  1-bit  condition  -  supplied  by  the  execution-unit  --  selects  one  of 
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Figure  6.3.1:  Remote-PC  Scheme. 


two  possible  paths.  For  unconditional  sequencing,  a  single  path  is  possi¬ 
ble,  whether  it  be  a  linear  (no  transfers),  or  a  non-linear  (unconditional 
transfer)  one. 

•  Static /Dynamic  Transfer  Target:  The  address  of  the  instruction  to-be- 
executed-next  is  usually  known  statically  at  compile  time  and  thus  does 
not  depend  on  the  execution- unit.  Exceptions  are  the  return  instruc¬ 
tions  and  the  infrequent  "computed  jumps"  (e.g.  for  case  statements). 


The  average  communication  bandwidth  between  a  remote-PC  instruction-unit 
and  the  execution-unit  is  not  much  more  than  the  minimum  bandwidth  needed 
to  merely  supply  the  instructions  to  the  execution  unit.  That  is  so  because 
addresses  are  transmitted  infrequently  between  the  two  units.  Control  transfers 
with  a  dynamic  target  are  not  very  frequent  (return  instructions  are  less  than 
5Z  of  all  executed  RISC  instructions  [PaSe8l]).  Also,  the  jump  instructions  do 
not  need  to  be  sent  to  the  execution  unit.  The  bandwidth  achieved  in  this  way  is 
significantly  lower  than  the  one  required  in  the  conventional  scheme,  where  the 
PC  is  kept  in  the  execution  unit  and  has  to  be  sent  out  of  it  for  every  single 
instruction-fetch.  This  low  bandwidth  shows  that  information  is  maintained  and 
processed  locally  to  a  maximum  degree;  this  makes  such  a  partitioning  desir¬ 
able. 
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In  order  for  the  remote-PC  implementation  to  be  successful,  the  value  of 
PC  should  be  used  as  little  as  possible  in  the  execution  of  instructions.  In  par¬ 
ticular,  no  PC-relative  addressing  mode  should  exist  for  data  accesses  (see  § 
3.1.2).  An  instruction-unit  with  remote-PC  and  with  an  instruction  cache,  will  be 
the  basis  for  the  hardware  enhancements  proposed  in  the  rest  of  this  section. 
The  Instruction-Cache  chip  that  was  designed  and  built  for  RISC  II  [Patt83]  does 
include  a  remote-PC;  however,  the  latter  contains  only  an  estimated  value  of  the 
real  PC,  and  is  used  for  predictive  fetching  of  instructions.  The  RISC  II  CPU  was 
not  designed  for  a  remote-PC  system,  and  thus  it  includes  the  PC  in  itself,  and  it 
has  PC-relative  load  and  store  instructions. 


6.3.2  Jumps,  and  Delays  Introduced  by  them. 

An  instruction  fetch-Jc-sequence  unit  has  to  deal  efficiently  with  control- 
transfer  instructions,  because  they  occur  very  frequently.  The  following  table 
reviews  some  of  the  measurements  presented  in  §  2.2.1: 


Propert- 


Measurement:  Reference: 


branching  instructions 


branch  instruction 


uncond.  jumps,  rel.  to  all  jumps 


HLL  statements  static  all 


if 

call 


HLL  statements  dvnamicall 
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[AlWo75J 
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frequency  of  jumps  is  concerned,  and  illustrates  the  same  point: 


while(new  !=  NIL  kk  old  !=  NIL)  /*  NIL  is  0  */ 

J  if(new->bb.l  <  old->bb.l)  j  infrequent  { 
else  J  if(n  <  old->bb.t) 

[  if(last  ==  NIL)  {  rare  j 

else  {  last->next  =  old;  last  =  old;  } 
old  =  oldList; 

if(old  1=  NIL)  oldList  =  old->next; 

I 

else  j  infrequent  { 

i 

if(depth[last->layer]  ==  0)  {  50 X  of  the  times:  call  a  procedure 
if((depth[last->layer]  +=  last->dir)  ==  0)  {  again  50 Z:  call  j 
nextEnd  =  (nextEnd  <  last->bb.t  ?  nextEnd  :  last->bb.t): 


Besides  illustrating  the  high  frequency  ot  jump  instructions,  the  above  pro¬ 
gram  fragment  also  shows  the  intimate  connection  between  jumps  and  test  or 
compare  instructions.  The  usual  pattern  is  that  a  number  is  compared  to  zero 
(test),  or  two  numbers  are  compared  to  each  other  (compare),  and  a  condi¬ 
tional  jump  is  then  executed,  based  on  the  outcome  of  the  comparison.  Hen- 
nessy  et.al.  studied  how  many  of  the  conditional  jumps  require  an  explicit  com¬ 
parison  operation  performed  for  their  sake  and  how  many  of  them  can  use  the 
result  of  some  other  instruction  [HennB2,  table  2.3].  They  measured  that  less 
than  2  7,  of  the  conditional  jumps  were  able  to  use  condition-codes  set  by  an 
instruction  that  was  not  executed  solely  for  that  purpose. 

Fast  control-transfer  instructions  are  particularly  important  for  high  pro¬ 
cessor  performance,  because  they  are  so  frequent  and  because  they  block  the 
fetch-execute  pipeline.  This  importance  becomes  even  more  significant  when 
the  combined  time  spent  for  branches  and  comparisons  is  considered.  For  that 
reason,  the  rest  of  this  section  will  focus  on  fast  compare  -and -branch  opera¬ 


tions. 
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6.3.3  Fast  Compare-and-Branch  Scheme. 

Figure  0.3.2(a)  shows  the  compare-and-branch  scheme  followed  in  RISC  I  k 
II.  Given  that  in  about  half  of  the  cases  the  optimizer  is  able  to  move  something 
useful  into  the  cycle  labeled  "OTHER",  we  can  say  that  this  scheme  takes  about 
2.5  cycles,  on  average,  for  a  comparison  and  a  branch. 
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Figure  6.3.2:  Branching  Schemes. 


There  are  two  possibilities  for  improvement  of  this  scheme.  First,  the  com¬ 
parison  and  the  decision  whether  to  take  the  branch  or  not  can  be  executed  in 
parallel  with  the  computation  of  the  possible  branch  target,  PC+oflset.  This 
would  require  two  ALU’s,  but  it  would  reduce  the  time  for  a  compare-and-branch 
to  about  1.5  cycles.  This  scheme  comes  only  natural  when  a  separate  instruc¬ 
tion  fetch-and-sequence  unit  Is  used,  like  the  one  of  §  6.3.1.  Second,  the  branch 
target  address  is  known  at  compile  time,  and  there  is  no  reason  —  other  than 
code  compactness  —  why  it  should  be  recomputed  every  time  the  branch  is  exe¬ 
cuted.  When  an  instruction  cache  is  used,  a  solution  exists  that  allows  both  the 


code  to  be  compact  and  the  target  address  computation  not  to  slow  down  the 
instruction-fetching.  It  will  be  presented  in  the  next  subsection  6.3.4. 

Figure  8.3.2(b)  shows  the  proposed  fast  compare-ic-branch  scheme.  It 
makes  use  of  both  of  the  above  improvements,  and  it  allows  single-cycle 
compare-ic-branch  instructions.  However,  now  there  is  not  enough  time  for  the 
jump/no-jump  decision  to  be  made  before  the  fetching  of  the  target  instruction 
begins.  Thus,  a  two-port  instruction  cache  is  required,  that  fetches  simultane¬ 
ously  both  possible  targets  of  the  conditional  branch. 

6.3.4  Target  Address  Specification  for  Fast-Branching. 

Most  branches  have  a  target  not  very  far  from  the  branch  itself.  Some 
measurements  for  a  particular  language  and  architecture  have  shown  55%  of  the 
branches  targeted  within  a  distance  of  12B  bytes,  or  93%  of  them  targeted  within 
a  16  Kbytes  range  (see  §  2.2.1  [AlWo75]).  This  locality  property  is  the  basis  of 
the  familiar  PC-relative  branch  instruction,  which  achieves  high  code  density  by 
specifying  the  target’s  distance  from  the  branch.  Figure  6.3.3(a)  illustrates  this 
method.  In  that  figure,  thick  lines  represent  information  statically  determined 
(by  the  compiler)  and  thus  included  in  the  instruction.  Thin  lines  represent 
dynamically  computed  information.  In  the  traditional  PC-relative  branch 
scheme,  shown  in  (a),  the  instruction  contains  a  (n  +  l)-bit  offset  field,  which  is 
added  to  the  PC  at  execution  time  and  produces  the  conditional  target  address. 
All  branches  to  within  a  distance  of  ±2"  from  the  current  instruction  can  be 
represented  with  this  instruction  format  (see  the  little  graph  on  the  right). 

The  rest  of  figure  6.3.3  shows  three  variants  of  the  proposed  alternative 
fast-branching  scheme.  The  key  idea  here  is  that  the  instruction  contains  the  n 
least-significant  (LS)  bits  of  the  conditional  target  address  itself,  rather  than  of 
its  offset.  In  this  way,  the  instruction  cache  can  start  fetching  that  target  as 
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Figure  6.3.3:  Avoiding  the  LS-part 
of  the  addition  to  determine  the  conditional 
target  of  a  PC-relative  branch. 


soun  as  the  branch  instruction  becomes  known  to  it,  without  having  to  wait  for 
the  result  of  ann-bit  addition.  This  approach  assumes  that  the  block  address  of 
the  cache  is  not  wider  than  n  bits.  The  most-signiflcant  (MS)  part  of  the  condi¬ 
tional  target  address  still  has  to  be  computed  at  execution  time,  assuming  a 
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compact  branch  instruction  that  cannot  contain  the  whole  address.  However, 
this  computation  can  be  performed  in  parallel  with  the  cache  RAM  access,  as 
long  as  its  result  becomes  available  in  time  for  the  address  tag  comparison. 
There  are  three  different  ways  in  which  the  MS  part  of  the  conditioned  target 
address  can  be  computed  in  the  fast-branch  scheme,  and  they  are  illustrated  in 
parts  (b).  (c),  and  (d)  of  fig.  6.3.3. 

The  straightforward  transformation  of  the  traditional  scheme  (a)  into  the 
fast-branch  scheme  is  shown  in  (b).  The  sign-bit  of  the  offset  and  the  carry-bit 
from  the  (virtual)  n-LS-bit  addition  are  computed  by  the  compiler  and  included 
in  the  instruction,  so  that  the  MS  part  of  the  same  addition  can  be  recreated. 
This  scheme  requires  an  (n  +  2)-bit  instruction-field  for  specifying  the  conditional 
target,  and  it  achieves  a  worst-case  branching  range  of  ±2". 

The  second  variant  shown  in  (c)  is  better  than  that  of  (b).  The  compiler 
supplies  again  2  bits  of  information  for  computing  the  MS  part  of  the  conditional 
target  address.  However,  instead  of  adding  both  of  them  to  bit  <n>  of  PC  as  in 
(b),  they  are  considered  as  forming  a  2-bit  signed  number  which  is  added  to  the 
MS  part  of  PC.  This  achieves  a  worst-case  branching  range  of  —  2n* 1  to  +2".  The 
scheme  in  (d)  is  similar  to  that  in  (c).  It  uses  the  PC<n- 1>  bit  as  a  round-off 
bit  to  achieve  an  equally  balanced  worst-case  branching  range  of  ±1.5x2". 

Figure  8.3.4  shows  the  block  diagram  of  an  instruction  fetch-and-sequence 
unit  that  incorporates  a  two-port  instruction  cache  and  the  fast  compara-and- 
branch  scheme.  (This  is  the  simple  form  of  the  I-unit;  figure  6.3.6  shows  the  full 
form.)  The  double  register  \PQplus  1,  Instruction  Register  j  at  the  CPU  interface 
is  loaded  at  the  beginning  of  each  execution  cycle  with  the  instruction  to  be  exe¬ 
cuted  and  with  its  incremented  PC  value.  The  incremented-PC  value  is  used  as 
the  address  tor  one  of  the  two  instruction-cache  ports,  and  causes  the  subse¬ 
quent  instruction  to  be  fetched.  Simultaneously,  the  appropriate  field  of  the 
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Figure  6.3.4:  I-fetch  Unit  with  Fast-Branch  Scheme 
(simple  version). 


current  instruction  —  assuming  that  this  is  a  branch  -  is  used  for  determining  a 
possible  target-address,  according  to  the  scheme  of  fig.  6.3.3  (c)  or  (d).  This 
possible  target-address  is  fed  to  the  second  port  of  the  instruction  cache,  and  a 
possible  target-instruction  is  fetched.  At  the  end  of  the  cycle,  the  execution 
unit  has  decided  whether  this  was  a  conditional  branch,  and  whether  it  should  be 
taken  or  not  According  to  that  decision,  the  multiplexors  IRMUX  and  PCMUX 
select  the  output  of  the  first  or  second  cache  port  and  the  output  of  the  first  or 
second  address  incrementer,  in  order  to  load  the  instruction-  and  the 
incremented-PC  registers.  Notice  that  the  instruction  cache  should  not  initiate 
the  miss-process  before  it  is  certain  that  the  offending  access  is  for  an  instruc¬ 
tion  that  will  actually  be  executed.  The  next  subsection  looks  at  the  timing  of 
compare-3c-branch  instructions  in  more  detail. 
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6.3.5  Form  of  Comparisons  in  the  Fast-Branch  Scheme. 

Figure  8.3.5  is  a  graph  of  the  timing-dependencies  for  the  compare-&- 
branch  instruction.  It  is  similar  to  the  one  in  figure  4.2.1,  and  it  assumes  a  pro¬ 
cessor  with  a  data-path  similar  to  that  of  RISC  II  and  with  the  fast-branch  I-unit 
of  fig.  6.3.4.  The  points  labeled  A,  B,  and  C  on  this  graph  correspond  to  the  simi¬ 
larly  labeled  points  in  fig.  6.3.4. 


Figure  6.3.5:  Timing  Dependencies 
in  Proposed  Compare-&-Branch  Scheme. 


At  point  A,  an  instruction  is  ready  to  start  executing.  Assume  that  it  is  a 
compare-&-branch  instruction  that  has  to  perform  a  comparison  in  the 
execution-unit  (upper  cycle),  while  the  I-unit  fetches  its  two  possible  successor 
instructions  (lower  cycle).  The  n  LS  bits  of  the  addresses  of  both  candidate  suc¬ 
cessors  are  known  at  point  A  (see  figures  6.3.4  and  6.3.3(c,d)).  Thus,  the  cache 
RAM  access  can  begin  immediately.  (We  assume  that  n  bits  are  enough  to 
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address  the  words  and  the  blocks  of  the  cache  RAM).  The  cache  RAM  access  is 
complete  at  point  B,  at  which  point  the  MS  part  of  the  conditional-target  must 
also  have  been  computed.  The  cache  tag  comparison  may  begin  at  point  B  and 
must  be  completed  at  point  C  when  the  next  instruction  is  ready  to  be  selected. 
At  the  same  point  C,  the  two  incrementers  INC  (fig.  6.3.4)  must  have  valid  out¬ 
puts;  notice  that  the  carry  of  the  second  INC  propagates  in  parallel  with  the 
carry  of  the  target-address  addition. 

At  the  same  point  C,  the  execution-unit  must  have  decided  whether  the 
branch  should  be  taken  or  not,  so  that  the  next  instruction  can  be  selected. 
Assuming  an  instruction  set  and  a  data-path  similar  to  those  of  RISC  II,  the 
compare-Je-brancb  instruction  must  decode  two  registers  (A-*D),  must  read 
them  from  the  register-file  (D-»E),  and  must  compare  them  and  decide  (E-»C). 
This  comparison  may  or  may  not  require  a  subtraction.  Figure  6.3.5  clearly 
shows  that  it  would  be  overly  restrictive  for  the  data-path  to  leave  enough  time 
between  points  E  and  C  for  a  full-width  subtraction.  The  reason  is  that  all  other 
instructions  may  allow  almost  a  full  cycle  E-*F  for  the  ALU  operation.  (See  fig. 
4.2.1;  however,  here,  we  do  not  want  the  ALU  to  be  on  the  1-fetcb  critical  path.) 
Thus,  the  possible  forms  of  the  comparison  should  be  restricted,  so  that  no  sub¬ 
traction  is  required. 

This  means  that  comparisons  to  zero  ("tests”)  can  be  allowed,  as  well  as 
comparisons  of  two  arbitrary  quantities  for  equality  or  inequality.  None  of  those 
requires  a  circuit  with  carry-propagation  for  its  detection  -  they  are  resolved  by 
just  looking  at  the  most  significant  bit  of  the  source  or  by  using  a  wide 
precharged  NOR  gate  to  check  for  equality  to  zero,  possibly  after  computing  a 
bitwise  exclusive -OR  in  the  ALU.  However,  a  magnitude  comparison  of  two  arbi¬ 
trary  quantities  should  not  be  allowed  in  a  compare-ic-brancb  instruction,  since 
it  requires  an  adder  or  a  priority  encoder  for  its  implementation. 
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This  restricted  comparison  form  is  another  example  of  an  architectural 
decision  made  for  implementation  reasons.  To  evaluate  its  effects  in  real  pro¬ 
grams,  the  critical  program  fragments  presented  in  sections  2.3  and  2.4  were 


< 


analyzed  by  hand. 


All  comparisons  for  conditional  branches  were  classified  into  four 
categories: 

•  TST:  arbitrary  comparisons  to  zero; 

•  EQ/NE:  comparison  of  two  quantities  for  equality /inequality; 

•  CMP-SOFT:  those  comparison  used  for  deciding  the  termination  of  a  D0- 
loop,  which  are  of  the  form  if  [i^UMIT),  and  which  could  be  re-written  in 
the  form  if  ( i --LIMIT ); 

•  CMP-HARD:  all  comparisons  of  two  arbitrary  quantities  for  >,  <,  and 

which  are  not  CMP-SOFT. 

They  were  counted  dynamically;  the  average  percentages  in  each  type  of 
comparison  are  given  below,  separately  for  the  17  numeric  critical  loops  (§ 
2.3),  and  for  the  3  non-numeric  programs  (Jgrep,  sed,  mextra:  §  2.4): 


TST 

EQ/NE 

CMP-SOFT 

CMP-HARD 

18  %  ±28 
15  %  ±15 

numeric  programs 
non-num.  programs 

18  %  ±32 
55%  ±11 

13  %  ±23 
24  %  ±B 

51  %  ±43 
8%  ±B 

These  numbers  show  that,  with  no  program  re-writing,  the  restricted-form  com¬ 
parisons  would  be  useful  in  about  40%  of  the  cases  for  numeric  programs  and  in  i 

about  80%  of  the  cases  for  non-numeric  ones.  With  program  re-writing  or  with 
language  semantics  that  allow  equality  comparison  in  DO-loops,  these  numbers 
would  become  about  80%  and  about  85%,  respectively.  Thus,  the  restricted  com-  € 

parison  form  appears  quite  frequently  in  real  programs,  especially  in  non- 
numeric  ones. 

From  the  point  of  view  of  the  instruction  format,  the  compare-ic-branch  ® 

Instruction  can  fit  in  32  bit3,  although  certainly  not  in  a  fashion  compatible  with 
the  RISC  I  &  II  instruction  format.  This  incompatibility  may  result  in  perfor¬ 
mance  penalties  due  to  a  more  complicated  instruction  decoding,  if  this  scheme  • 

Is  used  within  an  ISP  like  that  of  the  Berkeley  RISC.  A  possible  set  of 
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instruction-fields  and  widths  for  that  instruction  is  the  following: 

•  3-bit  opcode, 

•  4-bit  branch-condition  specifier  and  52  selector, 

•  5-bit  ft, i  specifier, 

•  8-bit  52:  Ji$2  or  a  byte-wide  immediate, 

•  12-bit  target-address  specifier  (n  =  10  in  fig.  6.3.3). 

The  12-bit  target-address  specifier  allows  the  use  of  10-bit  addresses  for  the 
cache  RAM.  This  corresponds  to  a  maximum  cache  size  of  2  K-instructions  for  a 
2-way  associative  cache,  or  4  K-instructions  for  a  4-way  associative  one,  which 
are  reasonable  limits. 

6.3.6  Extension  for  Zero-Delay  Unconditional  Branches. 

Figure  6.3.6  shows  the  full  version  of  an  instruction  fetch-and-sequence  unit 
with  a  two-port  instruction  cache.  This  version,  besides  implementing  the  fast 
compare-4-branch  scheme  of  g  6.3.3,  also  executes  unconditional  branches  in 
"zero-time",  i.e.  without  holding  the  execution-unit  while  it  follows  those 
branches.  An  exception  are  unconditional  branches  which  follow  conditional 
ones  within  a  distance  of  1  or  2  instructions:  they  ttrill  hold  the  execution  unit 
for  1  cycle. 

As  we  saw  in  g  6.3.2,  about  half  of  all  branches  are  unconditional,  thus 
accounting  for  roughly  1/10  of  all  executed  instructions.  They  arise  in  a  natural 
way  when  the  non-linear  flow-diagram  of  a  program  is  converted  into  machine- 
code  stored  in  a  linear  memory.  Unconditioned  branches  describe  no  useful 
computation;  they  just  divert  the  instruction-fetching  process  onto  a  different 
path.  It  is  interesting  to  note  that  unconditional  branches  are  usually  "for  free" 
in  micro-code  since  micro-instructions  usually  contain  the  address  of  their  suc¬ 
cessor  inside  themselves.  The  same  general  scheme  can  also  be  used  with 
macro  instructions. 
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Figure  6.3.6:  I-fetch  Unit  with  Fast-Branch  (full  version). 


A  two-port  instruction  cache  is  used  for  simultaneously  fetching  both  possi¬ 
ble  targets  of  compare-ie-branch  instructions.  The  basic  idea  behind  the  full 
version  of  the  I-unit  is  to  also  exploit  both  ports  of  the  instruction-cache  when 
instructions  other  than  compare-&-branch  are  executing.  When  such  an  "other” 
instruction  is  executing,  the  PCplus  1  and  PCphis2  registers  are  used  to  fetch  its 
next  two  instructions,  say  I§ j  and  1$ 2.  If  1st  Is  810  unconditional  branch,  and  /$j 
is  neither  an  unconditional  nor  a  conditional  one,  then  /$  j  can  be  supplied  to  the 
E-unit  for  normal  execution,  and  the  target  address  of  I$z  can  be  followed 
immediately.  While  I$i  is  being  executed  in  the  E-unit,  the  target  of  Isz  is 
fetched;  and  the  unconditional  branch  becomes  invisible  to  the  execution  unit. 
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Unconditional  branches  can  have  an  instruction-format  different  from  that 
of  conditional  ones.  This  makes  possible  the  immediate  pursuit  of  the  target 
address  without  the  need  for  techniques  similar  to  those  of  the  fast  compare-&- 
branch  scheme  (§  6.3.4).  Unconditional  branches  can  have  a  3-bit  opcode  and  a 
29-bit  absolute  target  address.  It  is  advantageous  to  make  the  target  of  all 
unconditional  branches  be  an  even-word  aligned  instruction.  In  that  way,  both 
the  target  instruction  and  the  instruction  next  to  the  target  can  be  fetched  by 
concatenating  the  29-bit  target-field  with  ”000"  and  with  "100"  (byte  addresses), 
respectively.  These  quantities  are  fed  to  PCplus  1  and  PCplus  2  as  soon  as  I$z  is 
detected  to  be  an  unconditional  branch.  This  format  for  unconditional  branches 
also  offers  a  means  of  branching  to  arbitrarily  distant  memory  locations,  unlike 
the  restricted-range  compare-Je-branch  instructions. 

Call  instructions  perform  the  same  function  as  do  unconditional  branches 
do,  except  that  they  must  also  save  the  PC  for  use  upon  procedure  return.  If 
the  place  to  save  the  PC  is  a  fixed  register  in  the  child’s  window,  then  call 
instructions  can  have  the  same  format  of  a  3-bit  opcode  and  a  29-bit  absolute 
target  address.  The  I-unit  can  treat  calls  similar  to  unconditional  branches, 
except  that  it  must  also  supply  them  to  the  E-unit,  which  must  execute  them  by 
saving  the  PC. 

Normally,  an  unconditional  branch  should  first  appear  as  the  second  one  of 
the  two  consecutive  instructions  7ji  and  Isz  being  fetched  from  memory  loca¬ 
tions  M[PCplus  l]  and  M^PCplusZ],  respectively.  The  reason  is  that,  if  execution 
has  been  sequential  in  the  recent  past,  then  an  unconditional  branch  fetched  via 
H[PCphis  l]  would  also  have  been  fetched  via  M\_PCplusZ]  one  cycle  earlier  and 
would  have  been  executed  at  that  time.  That  will  not  happen  if  execution  has 
not  been  sequential  in  the  recent  past,  and  specificly  if  an  unconditional  branch 
is  the  target  of  another  one  or  if  it  follows  a  compare-4c-branch  within  a  distance 


of  1  or  2  instructions.  (Notice  that  occurrences  of  the  former  situation  can  be 
removed  by  the  optimizer  or  the  linkage  editor.)  Because  these  cases  have  to  be 
dealt  with,  unconditional  branches  should  also  be  detected  and  handled  when 
appearing  at  the  first  (51)  port  of  the  I-cache.  However,  in  those  cases  there  is 
nothing  else  useful  that  can  be  supplied  to  the  E-unit  while  the  target  of  the 
branch  is  being  fetched.  Thus,  the  unconditional  branch  itself  can  be  given  to 
the  E-unit,  which  should  interpret  it  as  a  noop. 

In  this  section,  instruction  fetch-and-sequence  units  of  increasing  sophisti¬ 
cation  have  been  proposed.  When  sufficient  hardware  resources  are  available  for 
the  implementation  of  such  an  I-unit,  a  high-performance  execution-unit  can  be 
kept  busy  and  the  time  spent  executing  control-transfer  instructions  can  be 
reduced.  Roughly  1.5  cycles  per  conditional  branch  and  one  cycle  per  uncondi¬ 
tional  branch  can  be  saved,  amounting  to  about  2.5  cycles  out  of  every  10  cycles 
of  execution. 


6.4  Pointers  and  Data  Caches. 

According  to  the  list  of  proposed  priorities  at  the  beginning  of  this  chapter, 
hardware  resources  that  remain  available  after  a  multi-window  register  file  and 
an  instruction-unit  have  been  implemented,  should  be  spent  for  a  data  cache. 
The  two  former  devices  were  investigated  in  the  previous  sections.  In  this  sec¬ 
tion,  the  special  nature  of  accesses  to  non-scalar  data  is  considered  as  far  as  the 
construction  of  an  effective  data  cache  is  concerned,  and  hardware  as  well  as 
programming  methods  for  its  exploitation  are  proposed. 


6.4.1  Data-Structure  Accesses  and  Data  Caches. 


Operands  used  by  programs  are  either  scalar  variables  or  elements  of  non¬ 
scalar  data-structures.  These  two  categories  differ  fundamentally,  as  pointed 
out  in  §  8.1.1.  Scalars  are  few  in  number,  they  occupy  little  memory  space,  they 
are  referenced  using  their  own  name,  and  several  of  them  are  used  repeatedly. 
They  are  often  used  to  refer  to  particular  elements  of  data-structures.  Data- 
structures,  on  the  other  hand,  have  many  elements  and  occupy  large  memory 
space.  Individual  elements  of  these  structures  are  accessed  via  dynamically 
computed  addresses. 

Accesses  to  non-scalars  follow  certain  typical  patterns: 


A  •  A  number  of  repeated  accesses  to  the  same  element  is  often  made  before 
interest  shifts  to  another  element  of  the  data-structure.  For  example, 
A[i.j ]  is  accessed  three  times  during  each  critical-loop  iteration  in  flg. 
2.3.1;  the  element  last~>layer  is  accessed  »  4  times  per  iteration  in  the 
procedure  ScanSubSwathQ  in  §  2.4.4.  Another,  less  frequent,  case  arises  in 
the  critical  loop  of  fgrep  in  flg.  2.4.1.  The  pointer  c  is  pointing  to  the  same 
element  of  the  structure  words  during  most  loop  iterations,  because  the 
scanner  is  searching  for  the  same  first  letter  of  the  desired  pattern  most  of 
the  time. 

B  •  Accesses  to  near-by  memory  locations  are  frequently  made.  They  arise  in 
two  different  ways.  First,  arrays  are  frequently  traversed  in  a  sequential 
manner  such  that  each  element  accessed  is  next  to  the  previously  visited 
one,  in  terms  of  its  memory  address.  This  is  true  for  sequential  scanning  of 
linear  arrays  (character  buffer  scanning  is  a  common  case),  as  well  as  for 
the  scanning  of  multi-dimensional  arrays  by  columns  (in  FORTRAN).  These 
occur  quite  frequently  (see  §  2.3,  2.4.1,  2.4.2).  Second,  more  than  one  of 
the  fields  of  a  structure  are  usually  accessed  before  program  execution 
moves  to  another  structure  (§  2.4.4).  Since  structures  are  often  small  in 
size,  their  fields  are  in  near-by  memory  locations.  To  evaluate  that  size, 
static  measurements  were  collected  (by  hand)  on  the  size  of  the  structure 
types  declared  in  15  C  programs,  including  a  screen  editor  (emacs),  a  HLL 
interpreter  (logo),  fgrep,  sed,  and  mexfro.  Among  54  structure  declara¬ 
tions  investigated,  about  4555  of  them  were  found  to  have  4  or  less  words; 
about  7055  had  8  or  less  words;  and  about  8555  had  16  or  less  words.  All  sizes 
are  for  the  VAX-1 1/780,  where  1  word  equals  4  bytes. 

C  •  However,  the  occasional  shift  of  accesses  to  remote  locations  can  not  be 
neglected.  It  occurs  whenever  array  accesses  are  not  sequential  in  address 
space,  or  when  various  nodes  of  dynamically  allocated  data-structured  are 
accessed.  The  former  case  is  not  very  frequent  for  linear  arrays,  but  it  is 
common  for  multi-dimensional  ones.  The  latter  case  occurs  due  to  the 
''random"  allocation  in  memory  of  nodes  that  are  linked  to  each  other.  In 
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practical  situations,  however,  this  allocation  is  not  completely  random,  and 
some  linked  nodes  do  end  up  next  to  each  other.  A  study  of  5  large  LISP 
programs  by  Clark  and  Green  [ClGr77],  for  example,  showed  that  about  1/4 
of  the  car  and  cdr  list  pointers  were  pointing  to  the  immediately  adjacent 
(forward  direction)  memory  cell. 


Registers  are  the  natural  device  for  holding  frequently  used  scalars,  because  of 
the  small  number  and  size  of  those  variables.  A  cache  memory,  on  the  other 
hand,  is  well  suited  for  keeping  the  elements  of  data  structures,  because  of 
access  patterns  A  and  B  above.  The  fact  that  caches  keep  the  most  recently 
used  words  in  them  accelerates  type-A  accesses,  while  the  fact  that  caches  fetch 
a  whole  block  when  a  word  is  missed  accelerates  type-B  accesses. 


fetch 

Il=load 


Figure  6.4.1:  One-Cycle  Load  Instructions 
in  a  RISC  II  with  a  Data-Cache. 


A  data-cache,  that  is  separate  from  the  instruction-fetch  port  of  the  CPU, 
can  allow  one-cycle  load  and  sfore  instructions  in  the  RISC  II  architecture  and 
pipeline.  A  timing  scheme  similar  to  that  of  fig.  3.3.4(b)  is  used  for  that  pur¬ 
pose,  except  that  here  no  restriction  needs  to  be  placed  on  the  RISC  II  address¬ 
ing  mode  /?*♦--* Af[.ff,i+S2].  Figure  6.4.1  shows  the  timing  of  a  load  instruction 
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when  a  data-cache  is  used.  The  data-cache  access  may  begin  just  a  little  while 

after  the  addressing  addition  has  started  in  the  ALU.  The  reason  is  that  only  the 

least-significant  bits  of  the  effective  address  2  are  required  for  the  cache 

RAM  access.  The  rest  of  the  bits  of  the  effective  address  are  needed  later  on 

when  the  address  tag  comparison  takes  place.  Internal  forwarding  allows  the 

next  instruction  12  to  use  the  loaded  data.  For  a  cache  that  requires  the  n  LS 

0 

address  bits  for  its  RAM  access,  the  timing  constraint  is: 

(Data  Cache  Access  Time )  <  T  -  (n  —bit  Add  Time ). 

Notice  that  for  this  scheme  to  be  possible,  the  data-cache  needs  to  be  separate 
from  the  instruction-fetch  port  of  the  CPU  (e.g.  the  instruction-cache),  so  that 
the  required  parallel  access  to  both  of  them  is  possible. 

Cache  misses  limit  overall  performance.  Accesses  of  type  C  cause  misses, 
and  so  do  those  of  type  B  when  they  cross  block  boundaries.  The  next  subsec¬ 
tion  proposes  hardware  and  programming  methods  for  reducing  the  number  of 
those  misses. 


6.4.2  Pointers  as  Pre-Fetching  Hints. 

There  is  an  intimate  connection  between  pointers  and  data-structure 
accesses.  Most  accesses  to  fields  of  structures  are  made  indirectly  through  a 
pointer  scalar  variable  that  is  pointing  to  the  structure,  and  most  of  those 
pointer  variables  are  local  ones.  Similarly,  in  non-numeric  programs,  array 
accesses  are  often  made  using  pointers  to  the  array  elements  such  as 
character-buffer  accesses  in  text-processing  programs.  For  example,  85  7%  of 
the  accesses  to  non-scalar  data  made  in  the  critical  loops  in  §  2.4  have  the  form 
*p  or  p—>fld,  where  p  is  a  local  scalar  variable;  the  remaining  15  7.  of  those 

accesses  have  the  forms  arr[n]  or  (non -scalar) ->  fid  f.  In  numeric  programs, 

t  These  measurements  are  static;  however,  they  were  collected  only  from  critical  loops  (§ 

2.4). 
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array  accesses  are  traditionally  made  using  subscripts  (in  languages  like  FOR¬ 
TRAN  this  is  the  only  choice).  These,  however,  could  also  be  replaced  by  indirec¬ 
tions  through  local  pointers  by  the  optimizer  or  by  the  sophisticated  program¬ 
mer,  as  noted  in  §  2.3.4.  Such  a  replacement  would  simplify  the  required 
address  computations,  and  would  also  make  forthcoming  proposals  applicable  to 
these  accesses  as  welL 

When  a  scalar  pointer  variable  is  loaded  with  a  new  value,  it  is  very  likely  for 
that  value  to  be  used  shortly  afterwards  for  an  indirect  access  to  an  array  ele¬ 
ment  or  to  a  field  of  a  structure.  For  example,  after  90  %  of  the  assignments  to 
a  local  pointer  p  made  in  the  critical  loops  in  §  2.4,  the  assigned  value  was  used 
for  an  access  of  the  type  *p  or  p->fld.  These  accesses  are  always  made  to  the 
memory  word  where  p  is  pointing  to  (*p),  or  to  neighbouring  words  (p->fld ). 
In  the  remaining  10  %  of  the  cases,  p  was  used  for  purposes  other  than  indirect 
memory  accesses  -  for  example  it  was  compared  to  a  limit  value  $. 

This  suggests  that  a  data  cache  should  use  the  assignment  of  a  value  to  a 
pointer  variable  as  a  hint  for  prefetching  into  the  cache  the  block  where  this 
pointer  is  pointing  to,  if  it  is  not  already  there.  In  this  way,  some  of  the  type-C 
accesses  (5  8.4.1)  may  be  turned  into  type-A  or  type-B  ones,  and  the  miss  ratio 
may  be  reduced.  In  a  processor  where  registers  are  used  to  hold  the  most  fre¬ 
quently  used  scalar  variables,  the  criterion  for  prefetching  a  block  into  the  data 
cache  may  simply  be  the  writing  of  a  pointer  value  into  a  register.  One  way  of 
distinguishing  pointer  from  non-pointer  assignments  is  by  setting  aside  some 
registers  for  pointer  values  only  ("index  registers").  This,  however,  reduces  the 
flexibility  and  orthogonality  of  the  architecture  and  leads  to  non-optimal  utiliza¬ 
tion  of  the  register  file.  A  better  way  is  to  have  the  compiler  tell  the  machine, 

t  These  measurements  are  static;  however,  they  were  collected  only  from  critical  loops  (} 

2-4). 


with  one  bit  in  the  instruction,  that  the  assigned  value  is  a  pointer  one.  The  pro¬ 
posed  prefetching  scheme  is  reminiscent  of  the  way  loads  and  stores  are  per¬ 
formed  on  the  CDC-6600  computer  [Thor64];  a  memory-to-data-register  transac¬ 
tion  is  implicitly  initiated  every  time  an  address-register  is  loaded  with  a  value. 

The  success  of  this  prefetching  scheme  in  actually  reducing  cache  misses  is 
critically  dependent  on  the  amount  of  time  available  between  the  assignment  of 
a  value  to  a  pointer  and  the  first  use  of  that  value  for  an  indirect  memory 
access.  That  time  interval  has  to  be  long  enough  for  the  corresponding  block  to 
be  prefetched  into  the  cache  before  an  access  is  first  made  into  the  block. 
Depending  on  the  system  organization,  the  time  to  fetch  a  block  into  the  cache 
may  vary  in  the  range  of  about  4  to  10  machine  cycles,  or  about  4  to  10  RISC- 
style  instructions.  Of  course,  even  if  that  time  is  not  available  between  the 
assignment  to  a  pointer  and  its  first  use,  a  gain  still  exists  in  the  form  of  a 
shorter  miss  delay.  To  get  an  estimate  of  the  above  time  interval,  we  counted 
the  HLL  statements  in  the  critical  loops  in  §  2.4  that  are  executed  between  load¬ 
ing  a  pointer  and  first  indirecting  through  it  f.  The  results  were  as  follows: 

«  50%  of  the  cases  (8  pointer  loadings):  0  or  1  HLL  statements 
«  20%  of  the  cases  (3  pointer  loadings):  2  or  3  HLL  statements 
w  30%  of  the  cases  (4  pointer  loadings):  4  or  more  HLL  statements 

This  means  that  only  in  30  to  50  %  of  the  cases  there  is  enough  time  for  the  pre¬ 
fetch  to  be  complete  before  the  first  indirection  through  the  pointer  actually 
occurs. 

It  is  possible  for  this  time  interval  to  lengthened,  and  thus  for  the 

prefetching-hint  scheme  to  yield  better  results,  through  a  more  sophisticated 

t  These  measurements  are  static:  however,  they  were  collected  only  from  critical  loops  (§ 

2.4). 
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programming  technique.  In  critical  loops,  the  programmer  can  preload  a  local 
pointer  with  an  address  to  be  (potentially)  used  during  the  next  loop  iteration. 
For  example: 

/*  Run  through  a  list .  doing  some  processing  xoith  its  elements  */ 
struct  node  *current,  ‘next;  /•  local  pointers  */ 

next  =  head; 

while  (  (current=next)  1=  NIL  ) 

(  next  =  current->nxt;  /*  preload  */ 

...Do  the  processing,  using  current->  other  Fields... 

i 

In  a  few  cases,  a  code  rearrangement  with  similar  effects  could  be  made  by  an 
optimizing  compiler.  However,  most  cases  are  such  that  the  prefetching  will 
only  be  effective  when  done  almost  one  whole  loop  iteration  ahead  of  time.  This 
usually  requires  the  introduction  of  a  new  pointer  variable  by  the  programmer. 

Even  with  this  sophisticated  programming  technique,  misses  will  still  occur 
whenever  a  pointer  to  a  structure  is  loaded  with  a  value  pointing  into  one  cache 
block  and  is  subsequently  used  to  access  a  field  of  the  corresponding  structure 
which  overflows  onto  the  next  cache  block.  This  will  happen  most  frequently 
with  pointers  pointing  near  the  end  of  blocks  or  with  structures  that  are  larger 
them  the  block  size.  One  solution  can  be  to  prefetch  both  the  block  pointed  to 
by  p  and  the  next  block  whenever  p  is  loaded  with  a  value  that  points  "too 
close"  to  a  block's  end.  Another  solution  cam  be  to  try  to  allocate  structures  so 
that  they  do  not  cross  cache  block  boundaries  too  often.  Assuming  a  block  size 
of  B  words  =  32  bytes,  this  can  be  done  trivially  for  structures  of  sizes  2,  4,  or  8 
words.  In  the  static  measurements  reported  in  §  6.4.1  (B),  the  structure 
declarations  that  had  exactly  those  sizes  constituted  10%,  25%,  and  10%  of  all 
structure  declarations,  respectively. 
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Multi-Port  Memory  Organization. 


Throughout  this  dissertation  our  focus  has  been  the  central  processing  unit 
of  a  von  Neumann  computer.  The  predominant  pattern  of  simple  operations 
applied  to  large  volumes  of  operands  was  observed  in  typical  examples  of  fre¬ 
quently  occurring  computations.  As  a  consequence,  priority  was  given  to 
hardware  support  for  fast  operand  accesses  in  the  forms  of  multi-window  regis¬ 
ter  files,  instruction  fetch-and-sequence  units,  and  data  caches.  However,  the 
prevalent  role  of  access  to  data  or  information  is  not  confined  to  the  interaction 
of  the  CPU  with  its  surrounding  fast  storage  devices.  Also  important  is  the 
access  to  bulk  storage  devices,  to  other  processors,  and  to  remote  computing 
sites.  This  section  concerns  itself  with  the  access  to  information  in  main 
memory.  The  use  of  memories  with  multiple  ports  will  first  be  established. 
Then,  a  new  device  is  proposed,  a  modified  dynamic  RAM  chip,  which  effectively 
provides  a  second,  independent,  sequential-access  port  at  a  minor  additional 
cost. 


6.5.1  The  Need  for  Multi-Port  Memory  Systems. 

Figure  8.5.1  presents  an  overview  of  various  system  organizations,  showing 
their  requirements  for  multi-port  memory  systems. 


•  (a)  is  a  uni-processor,  perhaps  with  a  single  cache  memory.  Slow  I/O 
accesses  (e.g.  terminals,  telephone  lines)  may  be  made  via  the  CPU  or  via 
memory,  but  fast  I/O  accesses  (e.g.  disks,  local-area-networks  (LAN),  ras¬ 
ter  displays)  are  always  made  by  direct-memory-access  (DMA)  since  their 
bandwidth  is  so  high  that  the  CPU  cannot  handle  it.  At  least  two  high- 
bandwidth  memory  ports  are  seen  to  be  needed:  CPU  and  fast-1/0. 

•  (b)  is  a  higher  performance  uni-processor  with  two  memory-ports,  one  for 
Instructions  and  one  for  data  (§  3.3.2,  3.3.3).  This  system  may  have 
separate  instruction  and  data  caches,  as  proposed  in  this  chapter,  but 
simultaneous  misses  in  both  caches  are  possible  and  require  a  memory 
port  for  each  one  of  them.  Such  systems  can  be  implemented  with  two 
separate  main  memory  modules,  one  for  instructions  and  one  for  data. 
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Figure  6.5.1:  Need  for  Multi-Port.  Memories. 


However,  it  is  also  common  to  implement  them  with  a  single  shared 
module,  so  that  all  available  memory  can  be  used  regardless  of  whether 
there  is  little  code  and  many  data  or  vice  versa. 

(c)  is  one  possible  multi-processor  configuration.  Here,  the  need  for  mul¬ 
tiple  memory  ports  is  very  high.  Accessibility  to  the  memory  is,  in  fact, 
the  limiting  factor  in  system  expandability. 

(d)  is  a  possible  organization  of  one  node  in  a  future  multi-processor  sys¬ 
tem.  It  consists  of  a  uni-processor  like  (a)  or  (b),  connected  to  other  pro¬ 
cessors  and  1/0  devices  via  a  high-bandwidth  network  of  interconnected 
communications  components  [Fuji83].  A  separate  supervisor  CPU  may 
exist,  to  take  care  of  communications  overhead,  paging,  interrupt  han¬ 
dling,  etc.  We  consider  this  organization  as  more  attre  tive  than  (c), 
because  it  is  expandable  in  a  uniform  manner. 


Today,  multi-port  memories  are  made  by  providing  time-shared  access  to  a 
single  physically  available  memory  port.  This  is  because  true  multi-port  devices 
are  prohibitively  expensive  for  large  storage  systems,  but  also  because  most 
systems  are  of  type  (a)  above  with  fast-1/0  devices  that  have  an  average 
bandwidth  that  is  still  significantly  lower  than  that  of  the  CPU. 

When  multiple,  asynchronous  access  requests  are  made  to  a  single,  shared 
memory  port,  contention  will  arise.  Whenever  simultaneous  requests  occur,  all 
but  one  of  the  requesting  devices  will  have  to  wait.  For  example,  a  RISC  CPU 
uses  the  memory  during  all  cycles;  thus,  it  has  to  wait  when  an  I/O  access 
occurs.  In  a  VAX-ll/780  system,  the  processor  is  slowed  by  roughly  4%  for  each 
Mbyte/sec  of  average  I/O  traffic  when  its  memory  is  2-way  interleaved,  or  by 
roughly  10%  in  the  non-interleave d  option  [DEC81]. 

If  the  average  bandwidth  of  the  fast-I/0  devices  is  significantly  lower  than 
that  of  the  CPU  —  and  hence  than  that  of  the  memory  —  then  memory  conten¬ 
tion  will  only  have  a  minor  negative  effect  on  CPU  performance.  The  following 
table  gives  a  feeling  of  the  bandwidth  requirements  of  the  various  devices  that 
may  be  connected  to  a  memory  system  in  organizations  like  the  ones  in  fig. 
6.5.1.  Since  real  values  vary  widely,  the  table  only  gives  some  typical  examples. 


Example  of  Memory  Access  Characteristics  by  Various  Devices. 

Device 

Size 

of  access 
(Bytes) 

Average 

Bandwidth 

(MBvtes/sec) 

Peak 

Bandwidth 

(Mbvtes/sec) 

CPU 

4 

10. 

12. 

Cache 

32 

4. 

20. 

Slow  I/O  (Terminal, etc) 

1 

0.0001 

0.001 

Fast  I/O  (Disk.  IAN) 

2000 

0.3 

1. 

Thus,  in  typical  systems  of  today,  the  average  bandwidth  of  their  few  fast-I/0 
devices  represents  only  about  one  tenth  of  the  CPU  bandwidth.  This  is  a  toler¬ 
able  load  which  does  not  degrade  CPU  performance  too  much,  when  the  CPU  and 


1/0  share  a  single  memory  port  with  a  cycle-time  approximately  equal  to  the 
CPU  cycle-time. 

""'•s  situation,  however,  will  probably  change  in  the  future.  In  our  view,  sys¬ 
tems  will  evolve  towards  organization  (d),  since  expandable  multi-processors  will 
be  desirable  and  feasible.  The  communications  bandwidth  will  increase,  both 
because  of  advances  in  technology,  and  because  of  the  need  for  more  closely- 
coupled  parallel  execution.  This  bandwidth  increase  will  place  a  heavy  load  on 
the  memory  system.  It  is  important  to  notice,  however,  that,  whereas  the  CPU 
performs  random  accesses  to  the  memory,  a  fast  1/0  device  or  a  communica¬ 
tions  network  performs  serial  accesses  most  of  the  time,  since  large  pieces  of 
consecutive  memory  (e.g.  pages)  are  copied  in  or  out  of  the  memory.  The  inter¬ 
nal  organization  of  dynamic  RAM  chips  is  such  that  it  allows  the  inexpensive 
addition  of  a  second  serial-access  port  to  them,  thus  solving  the  CPU-I/0  con¬ 
tention  problem. 

6.5.2  DRAM  Chips  With  Secondary  Serial-Access  Port. 

The  top  part  of  figure  6.5.2  shows  a  typical  interned  organization  of  a 
dynamic  RAM  chip.  Every  time  an  access  is  made,  the  whole  row  along  the 
activated  word-line  is  read  out  of  the  storage  array  into  the  sense  amplifiers. 
Then,  a  single  bit  is  selected  by  the  column  address.  Thus,  there  is  a  huge 
difference  of  about  3  orders  of  magnitude  in  the  bandwidth  of  on-chip  and  off- 
chip  reading  operations.  In  a  memory  system,  the  chips’  row-address  is  typi¬ 
cally  the  physical  page  number.  Thus,  on  every  access,  one  whole  memory  page 
is  read  into  the  sense  amplifiers  of  the  memory  chips.  For  example,  if  a  1  Mbyte 
memory  consists  of  32  256-Kbit  chips  like  the  one  shown  in  fig.  6.5.2,  then,  on 
every  memory  access,  32  rows  of  1-Kbit  each  are  read  in  all  the  chips,  amount¬ 
ing  to  a  total  of  a  32-Kbit  or  a  4-Kbyte  page. 
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Figure  6.5.2:  DRAM  Chip  with  Secondary  Serial  Port. 


Even  though  a  whole  page  may  be  read  inside  the  DRAM  chips,  only  a  single 
word  is  selected  out  of  it  and  transmitted  ofl-cbip.  Many  recent  DRAM  chips 
offer  a  nibble  or  page  mode  of  access.  Under  the  former  mode,  four  adjacent 
bits  (words)  out  of  the  row  (page)  are  serially  transmitted  off-chip,  with  a  very 
short  delay  between  subsequent  bits.  Under  the  latter  mode,  any  subsequent 
access  to  the  same  row  (page)  can  be  made  with  a  shorter  delay  than  the  first 
one,  by  just  supplying  a  new  column-address  to  the  chips.  These  modes  of 
access  make  some  of  the  high  on-chip  bandwidth  available  to  off-chip  accesses. 
They  can  be  used  to  speed  up  cache-block  accesses  and  sequential  I/O  as  long  as 
it  is  uninterrupted  by  references  to  other  pages.  The  applicability  of  these 
modes  to  CPU  accesses  is  limited  because  the  CPU  does  not  usually  ask  for  adja¬ 
cent  words  or  for  words  from  the  same  page  on  consecutive  memory 
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transactions. 


The  lower  part  of  figure  6.5.2,  shows  the  proposed  addition  to  a  conventional 


DRAM  chip.  It  consists  of  a  shift-register  of  size  equal  to  one  row.  It  has  parallel 


connections  to  the  bit  lines  of  the  storage  array  and  a  serial  connection  to  one 


pin  of  the  chip.  One  additional  chip  pin  is  used  as  its  independent  shift  clock. 


This  proposed  "secondary-port  register"  can  be  loaded  in  one  memory  cycle  with 


the  contents  of  the  row  which  is  being  read  during  that  cycle.  On  the  scale  of 


the  entire  memory  system,  this  corresponds  to  coping  an  entire  page  into  the 


secondary-port  registers  of  the  memory  chips.  Once  this  page  transfer  has  been 


done,  the  parallel  connections  between  the  secondary-port  registers  and  the 


bit-lines  may  be  disabled,  and  their  serial  off-chip  ports  may  be  enabled.  These 


now  provide  a  second  totally  independent,  serial-access  memory  port,  which 


even  has  its  own  (asynchronous)  clock.  Through  this  "secondary  serial-access 


port",  the  entire  page  can  be  transferred  to  the  I/O  device  at  the  latter’s  own 


rate  of  transfer.  This  rate  can  be  higher  than  the  primary-port  rate,  because 


the  shift-register  is  faster  than  the  dynamic  storage  array.  Of  course,  the 


inverse  of  the  above  scenario  can  be  followed  for  transfers  from  the  I/O  device 


to  the  secondary-port  register  and  finally  into  an  arbitrary  memory  page. 


The  cost  of  this  secondary  memory-chip  port  is  roughly  similar  to  that  of 


the  sense-amplifiers  in  terms  of  silicon  area:  about  l/10th  of  the  chip.  In  terms 


of  pins,  3  additional  ones  are  needed  -  one  for  the  serial  data,  one  for  the  serial 


clock,  and  one  for  controlling  the  mode  of  page  transfer  f.  A  hidden  cost  of  this 


system  arises  from  the  fact  that  it  requires  the  page-number  to  be  the  row- 


address  instead  of  the  column-address.  The  row-address  is  required  to  be  avail¬ 


able  to  the  chip  at  the  very  beginning  of  the  memory  access.  Thus,  virtual-to¬ 


t  one  additional  eerie!  data  pin  is  required  if  simultaneous,  synchronous  input  and  output 
into/from  the  secondary-port  register  is  desired. 
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physical  address  translation  cannot  be  performed  in  parallel  with  row  decoding 
and  bit-line  sensing,  as  it  is  possible  when  the  page-number  is  used  as  column 
address. 

In  conclusion,  the  proposed  secondary  serial-access  memory  port  will  be 
advantageous  for  memories  in  systems  with  high  1/0  bandwidth,  such  that  the 
CPU-1/0  contention  for  memory  cycles  results  in  a  heavier  penalty  than  the 
address  translation  delay  that  the  new  memory  organization  may  incur.  In  gen¬ 
eral,  small  additions  or  reorganizations  of  hardware  may  lead  to  changes  in  sys¬ 
tem  architecture  that  can  result  in  large  gains  in  performance.  In  all  cases,  it  is 
important  to  first  identify  the  actual  bottleneck  areas  so  that  the  added 
hardware  can  be  as  effective  as  possible. 
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CHAPTER  7: 

CONCLUSIONS. 


Single-chip  implementation  of  a  general-purpose  von  Neumann  processor 
offers  advantages  of  low  cost  and  of  high  performance  owing  to  the  high 
bandwidth  of  on-chip  communications.  However,  even  in  Very  Large  Scale 
Integrated  (VLSI)  circuits,  the  limited  transistors  that  are  available  on  a  single 
chip  constitute  a  scarce  resource  when  used  to  implement  such  a  CPU.  In  this 
dissertation,  the  efficient  use  of  these  silicon  resources  was  studied,  showing 
that  a  Reduced  Instruction  Set  Computer  (RISC)  architecture  is  advantageous, 
because  it  allows  the  integration  of  units  providing  fast  access  to  operands  and 
instructions,  while  still  supporting  the  high-performance  execution  of  the  simple 
operations  required  during  most  part  of  general-purpose  computations. 

In  chapter  2,  the  nature  of  general-purpose  von  Neumann  computations  was 
studied.  Even  programs  that  are  heavily  oriented  to  numerical  floating-point 
computations  execute  an  equally  high  number  of  array  references,  simple  index 
arithmetic,  and  loop  control  instructions.  In  non-numerical  programs,  floating¬ 
point  computations  are  replaced  by  copying  or  compare-and-branch  instruc¬ 
tions,  and  array  references  are  often  replaced  by  indirect  accesses  through 
pointers.  This  shows  the  crucial  importance  of  fast  operand  accessing,  simple 
address  arithmetic,  and  fast  compare-and-branch  execution.  The  high 
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percentage  of  references  to  local  scalar  variables  («  60  %  of  all  variable  refer¬ 
ences)  and  the  fact  that  most  procedures  have  few  such  variables  (a  dozen  or 
less)  makes  hardware  support  for  fast  operand  accessing  mandatory.  The  nar¬ 
row  range  (&  ±3)  of  dynamic  procedure  nesting  depth  fluctuations  for  extended 
periods  of  time,  makes  a  small  set  of  overlapping  register  windows  a  viable 
approach. 

The  above  findings  led  to  the  formulation  of  the  Reduced  Instruction  Set 
Computer  (RISC)  concept  and  to  the  definition  of  the  Berkeley  RISC  architec¬ 
ture,  which  became  the  basis  of  a  large  group  project.  Within  this  project,  a 
second  NMOS  implementation  of  the  Berkeley  RISC  processor,  called  RISC  II,  was 
designed,  laid-out,  debugged,  and  successfully  tested  after  fabrication,  in  colla¬ 
boration  with  Robert  Sherburne. 

The  Berkeley  RISC  architecture,  as  described  in  chapter  3.  is  register-to- 
register  oriented.  It  has  simple  instructions,  a  simple  and  orthogonal 
instruction-format,  and  a  regular  timing  model  that  fits  all  instructions.  Fast 
operand  accesses  are  supported  by  a  large  register-file  with  multiple  overlap¬ 
ping  windows,  where  scalar  arguments  and  local  variables  of  procedures  are  allo¬ 
cated  by  default  until  the  window  is  filled.  Pipelining  is  utilized  to  keep  the 
operational  units  busy  executing  the  primitive  operations  selected  by  the  com¬ 
piler  and  optimizer.  The  combined  effect  of  the  138  registers,  organized  in  8 
windows,  and  of  the  3-stage  pipeline  yields  a  machine  of  significantly  higher  per¬ 
formance  than  other  commercial  processors  built  in  comparable  technologies 
but  with  complex  instructions  and  non-optimal  use  of  registers.  The  size  of  pro¬ 
grams  compiled  for  RISC  is  not  much  larger  than  it  is  when  compiled  for  other 
architectures,  even  though  only  simple  instructions  with  a  simple  but  code- 
inefficient  format  are  available  to  the  compiler. 


The  change  away  from  the  traditional  trend  towards  instruction  sets  of 
increasing  complexity  resulted  in  a  radical  change  in  the  allocation  of  the  chip 
area  to  the  various  CPU  functions.  As  shown  in  chapter  4.  the  control  section  of 
RISC  11  occupies  only  10%  of  the  area,  in  contrast  with  other  processors  with 
complex  instruction  sets,  where  the  control  and  micro-program  ROM  occupy 
more  than  half  the  chip  area.  Scarce  silicon  resources  are  thus  freed  and  used 
more  effectively  for  the  implementation  of  the  large  register  file.  The  simplicity 
of  the  processor  significantly  contributes  to  its  speed  by  reducing  the  number  of 
gate-delays  in  the  critical  path  and  also  by  reducing  the  physical  size,  and  thus 
the  parasitic  capacitances,  of  the  circuit  elements.  Designing,  laying-out,  and 
debugging  the  simple  RISC  II  processor  required  about  five  times  less  human 
effort  than  what  is  usual  for  other  microprocessors.  Chips  were  functionally 
correct  and  ran  at  the  predicted  speed  on  first  silicon  due  to  careful  simulation 
and  modeling,  as  described  in  chapter  5. 

Future  VLSI  technology  will  allow  the  integration  of  larger  systems  on  a  sin¬ 
gle  chip.  Beyond  multi-window  register-files,  such  technologies  will  also  allow 
on-chip  support  for  fast  access  to  other  important  elements  of  computations: 
instructions,  and  non-scalar  operands.  Chapter  6  proposed  suitable  organiza¬ 
tions  for  such  units.  Remote-PC  instruction-fetch-ic-sequence  units  keep  the 
program-counter  and  its  associated  logic  close  to  the  place  where  it  is  being 
used  and  provide  reduced  communication  costs  and  more  parallelism  in  execut¬ 
ing  jumps.  When  combined  with  a  dual-port  instruction-cache,  they  allow 
single-cycle  compare-&-branch  instructions  and  transparent  unconditional 
branches,  both  of  which  are  frequent  instructions  in  general-purpose  computa¬ 
tions  (about  1  out  of  4  instructions).  Instruction-caches  also  allow  the  CPU  to 
effectively  see  two  Independent  memory  ports  and  make  it  possible  to  access 
data  in  memory  while  the  next  instruction  is  being  fetched.  Further  inclusion  of 


a  data-cache  allows  address-computation  to  partially  overlap  data  access,  thus 
making  possible  single-cycle  data-memory-access  instructions. 

For  the  foreseeable  future,  Reduced  Instruction  Set  Computer  Architec¬ 
tures  appear  to  be  the  most  effective  way  to  use  the  scarce  chip  resources  to 
support  the  crucial  need  of  general-purpose  computations  for  fast  access  to 
operands  and  instructions. 
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APPENDIX  A: 


DETAILED  DESCRIPTION 
OF  THE 

RISC  II  ARCHITECTURE 


This  appendix  describes  in  detail  the  architecture  of  the  RISC  II  CPU  NMOS 
chip.  It  is  a  complement  to  chapter  3. 

Some  of  the  minor  details  of  the  RISC  II  architecture  are  "by-products"  of 
its  implementation.  Most  of  these  are  points  that  were  left  unspecified  in  the 
original  architecture.  One  important  exception  was  mentioned  in  §  3.1.2. 
Register-indexed  store  instructions  can  only  have  an  immediate  source-2  --  not  a 
register  R,z  (fig.  A.4.4).  This  restriction  was  imposed  for  implementation  rea¬ 
sons.  For  similar  reasons,  this  chip  does  not  implement  the  pointer-to-register 
scheme  (§  3.2.3).  On  the  other  hand,  it  has  some  additional  features:  (1)  condi¬ 
tional  returns,  and  (2)  compatibility  with  Expanding  Instruction  Caches  [Patt83]. 
In  this  appendix  the  RISC  II  architecture  is  described  from  various  points  of 

view: 

•  definition  of  the  user-visible  CPU  state  (§  A.  l); 

•  description  of  the  CPU  interface  to  the  outside  world  (memory,  I/O,  inter¬ 
rupts)  (J  A. 2); 


A. 


•  discussion  of  the  way  in  which  instructions  are  sequenced  (§  A.3); 

•  description  of  the  instruction  set,  the  instruction  format,  and  the  effect 
that  the  execution  of  each  instruction  has  on  the  user-visible  state  of  the 
CPU  and  on  the  CPU  interface  to  the  cutside  world  (§  A.4); 

•  specification  of  the  actions  taken  on  interrupts/traps  (§  A.5). 


A.1  User-Visible  State  of  the  CPU. 

The  User-Visible  State  (u-v  state)  of  the  CPU  is  the  state  of  the  processor 
chip  that  remains  after  execution  of  an  instruction  has  completed,  and  before 
execution  of  the  next  instruction  begins.  The  ‘'future"  of  the  processor  depends, 
at  that  point,  only  on  that  state.  The  above  "point  in  time",  may  not  be  well 
defined  in  the  implementation,  because  of  pipelining.  However,  the  RISC  archi¬ 
tecture  precisely  defines  the  state  of  the  CPU  at  that  point,  since  it  considers 
instruction  execution  to  be  indivisible  and  entirely  sequential.  The  implementa¬ 
tion  must  guarantee  that  the  overall  result  of  executing  an  arbitrary  program 
on  the  real  hardware  will  be  the  same  as  predicted  by  the  architecture  for 
purely  sequential  execution  of  the  same  program. 

Here,  we  must  draw  a  distinction  between  the  "normal  user"  and  the 
"interrupt-handler  programmer"  (i-h  programmer)  t-  The  "normal  user"  does 
not  see  the  interrupts;  what  (s)he  considers  one  instruction  may  in  fact  be  an 
instruction  that  was  aborted  and  then  restarted  after  an  interrupt  handler  com¬ 
pleted  its  task.  Such  a  user  has  his/her  u-v  state.  The  "interrupt-handler  pro¬ 
grammer"  has  a  finer-grain  definition  of  what  an  instruction  is;  for  him/her 

interrupts  and  aborted  instructions  are  clearly  visible.  Thus,  (s)he  needs  to 
t  la  this  section  the  word  Interrupt  is  used  to  signify  both  Interrupts  and  traps. 


have  additional  visible  state  for  the  interrupt-handler  to  work  with,  while  the 
normal  user's  visible  state  is  left  unaltered. 

Figure  A.  1.1  shows  the  RISC  II  User-Visible  State.  The  register  file  has  138 
32-bit  registers,  accessible  by  an  addressing  pair:  (  WindowNumber  .  Register- 
Number  ).  The  WindowNumber  ranges  from  0  to  7  and  is  always  provided  by 
CWP=PSW<12:10>.  -  The  RegisterNumber  is  provided  by  the  individual  instruction 
accessing  the  register;  it  ranges  from  31  to  0,  with  31  to  26  overlapping  with  the 
parent  (caller),  25  to  Id  being  the  locals,  15  to  10  overlapping  with  the  child  (cal- 
lee),  and  9  to  0  being  the  globals.  Register-0  always  has  the  (hard-wired)  value  0; 
writing  into  it  is  allowed,  but  has  no  effect  whatsoever. 

The  PSW  has  13  bits  containing  various  information.  The  CWP  specifies  the 
current  window,  and  SWP  is  the  limit-value  for  detecting  register-file 
over/under-flows.  The  fact  that  they  are  shown  to  point  to  register-16  has  no 
particular  significance  for  the  RISC  II  CPU;  this  is  in  analogy  to  the  pointer-to- 
register  scheme  (§  3.2.3).  The  I  bit  enables/disables  interrupts.  The  S  bit  shows 
the  current  CPU  privileges.  The  P  bit  is  a  place  to  save  the  previous  value  of  the 
S  bit  when  the  latter  is  altered  (set)  on  interrupts.  The  four  remaining  bits  of 
PSW  are  the  condition  codes. 

The  SWP  and  the  P  bit  are  parts  of  the  state  visible  by  the  i-h  programmer 
only.  Since  register-file  over/underflows  are  taken  care  of  by  an  interrupt- 
handler,  the  "normal  user"  is  given  the  illusion  of  a  register-file  with  infinitely 
many  windows  J.  The  i-h  programmer  uses  the  SWP  to  detect  over/underflows  of 
the  real  register-file,  the  necessary  window  save/restore  operations  are  carried 
out,  and  subsequently  the  value  of  the  SWP  is  changed  to  reflect  that  fact.  If  the 

normal  user  reads  SWP,  (s)he  may  find  "random  values"  in  it  (i.e.  values  that  do 

t  U  we  wanted  to  be  very  precise,  we  should  have  shown  that  in  fig.  A.  1.1,  together  with  a  CWP 
whoee  existence,  but  not  its  value,  is  known  to  the  normal  user. 
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Figure  A.  1.1:  User-Visible  State  of  RISC  II  CPU. 


not  depend  on  his/her  program  alone).  Similarly,  the  P  bit  is  used  to  save  a 
part  of  the  normal-user-visible  state  before  that  latter  is  altered  on  an  instruc¬ 
tion  abortion,  so  that  it  can  be  restored  later;  the  normal  user  is  given  the  illu¬ 
sion  of  an  uninterrupted  instruction  execution. 

The  definition  of  the  remaining  four  words  of  u-v  state  is  more  difficult  and 
does  not  completely  correspond  to  the  implementation.  The  reason  is  that  their 
values  are  of  a  dynamic  nature.  They  are  implicitly  and  automatically  changed 
by  the  execution  of  every  instruction;  and  thus  cannot  be  preserved  and  read 
by  some  other  instruction  at  a  later  time.  However,  they  do  belong  to  the  u-v 
state,  since  they  depend  on  the  previous  instruction(s),  and  they  determine  the 
future  activities  of  the  processor. 

At  the  (conceptual)  moment  when  an  instruction  completes  execution  and 
the  next  one  has  not  yet  started  executing,  the  Instruction  Register  contains  the 
instruction  to  be  executed  next.  The  PC  contains  the  address  of  that  next-to- 
be-executed  instruction.  This  information  would  in  general  be  redundant  since 
the  instruction  is  already  known,  but  it  is  needed  for  PC-relative  addressing. 

At  the  end  of  an  instruction  the  u-v  state  contains  the  address  of  the  next 
instruction  and  also  the  instruction  itself.  This  is  a  consequence  of  the  delayed- 
branch  scheme,  which  makes  the  instruction-fetch/execute  pipeline  visible  to 
the  user  (§  3.1.3).  This  also  explains  the  existence  of  NXTPC  in  the  u-v  state. 
NXTPC  is  the  address  of  the  instruction  to  be  executed  subsequently  after  the 
one  contained  in  the  Instruction  Register.  It  plays  the  role  of  PC  in  other  com¬ 
puters.  NXTPC  is  determined  by  the  instruction  that  just  finished  executing,  and 
not  by  the  instruction  that  will  execute  next,  in  accordance  with  the  delayed- 
jump  scheme.  The  LSTPC  register  is  part  of  the  i-h  programmer  visible  state 
only.  It  contains  the  address  of  the  instruction  which  just  finished  executing. 
Its  purpose  is  to  hold  the  value  of  the  PC  when  an  instruction  is  aborted  due  to 
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an  interrupt.  The  PC,  in  turn,  holds  the  value  of  the  NXTPC,  while  NXTPC  is  used 
for  fetching  the  first  instruction  of  the  interrupt-handler.  The  three  PC's  always 
have  a  0  least  significant  bit,  since  RISC  II  instructions  are  always  half-word 
aligned  in  main-memory. 

The  u-v  state  remains  unaltered  during  the  "first  part"  of  the  execution 
cycle,  when  the  processor  reads  some  parts  of  it  in  order  to  compute  some  new 
values  to  be  written  back  into  the  state  during  the  "last  part"  of  the  execution 
cycle.  In  particular,  PC-relative  instructions  will  use  the  value  that  the  PC  has 
at  the  beginning  of  the  execution  cycle. 


A.2  Interface  between  CPU  and  Outside  World. 

The  RISC  II  CPU  communicates  with  the  outside  world  by  read  and  write 
accesses  from/into  a  4Gbyte  virtual  address-space.  Thus,  all  1/0  and  System- 
Control  functions  are  memory-mapped.  Figure  A.2.1  shows  the  signals  that 
come  in  and  out  of  the  CPU. 

For  each  memory  access,  the  CPU  issues  an  address  and  specifies  the 
access  type  by  the  Read/Write  signal.  It  also  issues  some  additional  informa¬ 
tion,  which  the  peripheral  devices  may  or  may  not  use:  (i)  whether  the  processor 
is  currently  in  user  or  system  mode,  and  (ii)  whether  the  CPU  interprets  the 
accessed  item  as  instruction  or  as  data  (this  signal  is  always  "data"  for  write 
accesses).  Addresses  are  virtual,  32  bit  wide,  and  refer  to  bytes  in  memory. 


r 


The  RISC  II  architecture  understands  and  supports  byte  data  (8  bit  wide), 
half-word  data  (18  bit  wide),  and  word  data  (32  bit  wide).  Figure  A.2.2(a)  shows 
the  possible  alignments  that  those  types  may  have  in  memory  and  the  address 
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CPU 


Outside-World 


Clock  Phases 
Reset 

Interrupt  Req. 
Interrupt  Acknowl. 


ADDRESS 

WIDTHcodeW 

WIDTHcodeH 

Read/Write 
System  Mode 
Instr/Data 


INSTR/DATA 
Instr.  Length 


CPU  CONTROL 


ADDR. 


CONTROL 


MEMORY 

ACCESS 

REQUEST 


INSTRUCTIONS/DATA 
FROM/TO  MEMORY 


(shared) 


Figure  A.2.1:  CPU  -  Outside  World  Interface. 


used  to  refer  to  each  one  of  them  (addresses  shown  are  in  decimal).  Every  item 
must  be  aligned  so  that,  if  the  whole  memory  were  packed  with  items  of  the 
same  width  and  the  same  alignment,  no  item  would  cross  word  boundaries.  The 
address  of  an  item  is  the  address  of  the  least-significant  byte  in  it,  with 
addresses  increasing  towards  more-significant  bytes. 

The  memory  of  a  RISC  II  system  is  organized  in  words,  having,  however, 
separate  write-enable  controls  for  the  four  byte-wide  fields  (byte-banks)  in  it,  so 
that  it  can  selectively  write  some  but  not  all  of  the  bytes  in  a  word.  Read 
accesses  always  read  a  whole  word.  The  two  least-significant  address  bits  are 
discarded  (considered  to  be  zero)  and  the  corresponding  word  is  read  and  given 


to  the  CPU.  Further  selection  and  alignment  of  narrower  data  items  is 


•  V  *  A  * 
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performed  inside  the  CPU.  For  write  accesses,  the  CPU  always  outputs  a  full  32- 


bit  word  onto  the  bus.  However,  only  some  of  the  bytes  in  that  word  are  to  be 


written  into  the  corresponding  bytes  of  the  addressed  word.  Figure  A.2.3  shows 


an  example.  The  target  word  is  determined  as  before  by  considering  the  2  LS- 


bits  of  the  address  to  be  zero.  The  particular  bytes  to  be  written  are  deter¬ 


mined  by  the  two  original  least-significant  address  bits,  and  by  the  "width-code 


bits"  (fig.  A.2. 1),  according  to  the  table  in  figure  A.2.3.  The  width-code  bits  indi¬ 


cate  the  type  of  item  to  be  written.  They  could  be  encoded  in  one  bit,  but  the 


RISC  11  chip  leaves  them  unencoded.  Bit<i>  of  the  target  memory  word  is  writ¬ 


ten  with  bit<i>of  the  word  output  on  the  bus  (if  the  corresponding  byte-bank  is 


enabled);  in  other  words,  the  memory  does  not  need  to  perform  any  alignment. 


Instruction  fetches  need  not  be  different  from  data  reads  in  a  simple  RISC  II 


system.  All  instructions  that  the  RISC  II  CPU  understands  are  32-bit  wide.  In 


fact,  the  RISC  II  CPU  chip  uses  the  same  bus  and  timing  for  instruction  and  data 


fetches.  However,  the  RISC  II  CPU  is  also  compatible  with  Expanding  Instruction 


Caches  [Patt83].  These  are  instruction  caches  that  receive  from  memory  both 


18-  and  32-bit  wide  instructions.  Thsy  "expand"  the  16-bit  ones  into  equivalent 


32-bit  instructions,  which  they  subsequently  supply  to  the  CPU.  All  this  is  tran¬ 


sparent  for  the  CPU,  which  only  sees  32-bit  instructions,  except  for  the  fact  that 


NXTPC  has  to  follow  the  real  memory  address  of  the  instructions,  i.e.  it  has  to 


get  incremented  by  2  or  by  4,  according  to  the  length  of  the  lastly  fetched 


instruction.  This  is  the  purpose  of  the  instruction-length  signal  going  into  the 


CPU  (fig.  A.2.1).  In  the  absence  of  an  expanding  cache,  this  signal  must  be  per¬ 


manently  set  to  "4  bytes".  In  the  presence  of  such  a  cache,  it  is  the  cache  who 


provides  it. 


Finally,  the  "outside  world"  has  some  control  over  the  CPU  by  means  of  the 


clock  phases  and  the  reset-  and  interrupt-request  signals.  By  means  of  the 


clock  phases,  the  CPU  can  be  slowed  down,  or  even  stopped  temporarily,  which 
is  useful  for  example,  during  the  handling  of  a  cache  miss.  A  single  interrupt 
request  signal  is  provided.  It  is  assumed  that  interrupt  prioritizing  is  done  out¬ 
side  the  CPU.  The  reset  signal  acts  as  a  non-maskable  interrupt. 


A.3  Instruction  Execution  Sequencing. 

RISC  II  sequences  instruction  execution  in  the  usual  von  Neumann  way, 
except  for  the  visibility  of  the  fetch/execute  overlap  which  results  in  the 
delayed  jump  scheme  (§  3.1.3),  and  for  the  compatibility  with  Expanding 
Instruction-Caches  (§  A.2).  Figure  A. 4.5  describes  control-transfer  instructions. 

The  instruction-fetch  process  operates  in  parallel  with  the  instruction- 
execute  process  and  relatively  independently  from  it.  Out  of  the  user-visible 
state,  the  fetch  process  uses  the  PC’s,  and  the  execute  process  uses  the  register 
file  and  PSW.  The  two  processes  communicate  via  the  instruction  register  in  the 
"forward  direction”  and  via  the  control-transfer  instructions  in  the  "backwards 
direction".  When  an  interrupt  (or  trap)  occurs,  progress  of  the  execute  process 
is  aborted,  and  that  process  is  inhibited  from  altering  its  u-v  state.  However, 
the  fetch  process  is  allowed  to  proceed,  after  an  alteration  is  made  to  it  so  that 
it  starts  fetching  instructions  from  the  interrupt  handler  routine.  These  points 
will  be  discussed  further  in  §  A. 5. 
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A.4  Instruction  Set. 

Figure  A.4.1  shows  the  two  instruction  formats  of  the  RISC  11  CPU.  Figure 
A.4. 2  shows  the  binary  and  symbolic  opcodes  of  the  39  RISC  II  instructions.  The 
opcode  uniquely  determines  whether  the  instruction  has  the  short-immediate  or 
the  long-immediate  format,  and  figure  A.4.2  shows  that  correspondence.  The 
opcode  also  uniquely  determines  the  interpretation  of  the  DE ST  field  of  the 
instruction  (fig.  A.4.1(a)),  and  fig.  A.4.2  shows  that  correspondence  as  well.  The 
format  of  the  shortS0URCE2  field  of  the  short-immediate  instructions  is  deter¬ 
mined  by  its  leading  bit.  and  not  by  the  opcode  (fig.  A.4.l(b)).  Figure  A.4.2  also 
shows  which  instructions  are  privileged.  An  attempt  to  execute  a  privileged 
instruction  while  the  System-mode  bit  is  OFT  causes  an  immediate  trap  which 
prevents  the  instruction  from  executing  (§  A.5).  A  trap  is  also  caused  if  execu¬ 
tion  of  an  illegal  instruction  (one  with  unassigned  opcode)  is  attempted  (notice 
that  all  opcodes  with  a  leading  bit  equal  to  1  are  unassigned).  RISC  II  has  the 
same  39  instructions  as  RISC  I.  but  it  has  different  opcodes. 

The  instruction  set  can  be  subdivided  into  five  groups  of  instructions,  which 
are  described  in  figures  A.4.3  through  A.4.B  and  discussed  in  the  next  para¬ 
graphs. 

Register-to-register  OP  instructions:  Figure  A.4.3  shows  all  the  instructions 
of  the  second  column  of  fig.  A.4.2,  except  for  Idhi.  They  include  shift,  logical, 
and  integer-arithmetic  operations.  They  all  have  the  short-immediate  format. 
They  operate  on  register  rsl  of  the  current  window  and  on  the  source-2. 
Source-2  may  be  register  rs2  of  the  current  window  or  the  immediate  constant 
imml3  contained  in  the  instruction.  Registers  are  always  interpreted  as  32-bit 
quantities  (fig.  A.2.2(e)),  and  imml3  is  always  sign-extended  (fig.  A. 2.2(c)).  The 
result  is  written  into  register  rd  of  the  current  window,  and  the  instruction  may 
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1.  SHORT-IMMEDIATE  FORMAT: 


xxxOOOO 


xxxOOll 


xxxOlll 


xxxlOOO 

xxxlOOl 


xxxlOlO 


xxxlOll 


xzzllOO 


xxxllOl 


zzzllll 


OOOxxxx 

OOlxxxx 

OlOxxxx 

Ollxxxx 

lxxxxxx 

getpsw 


getlpc 


putpsrw 


calls 


callr 


©  i 


jmpx 


©  ret 


©  reti 


sub 


subc 


subi 


subci 


ldxw 


ldrw 


ldzhu 


ldrhu 


ldxhs 


ldrhs 


ldxbu 


ldrbu 


ldxbs 


ldrbs 


©:  conditional  instructions:  DEST-field  is  cond  (see  fig.  A.4.1(a)). 

1  double  boxes  :  long-immediate  format  instructions  (fig.  A.4.l(2)). 
empty  boxes:  illegal  opcodes. 

calli.  getlpc,  putpsw,  reti:  privileged  instructions. 


Figure  A.4.2:  The  RISC  II  opcodes. 


CC's  (fig.  A.  1.1) 
'Z  N  V  C 


rsl  (see  fig.  A.2.2(e),  A.4.1(l)) 


shortS0URCE2 


rd  (see  fig.  A-2.2(e).  A.4.1) 

!  s2<4:0> 


OP 

;(see  below) 


OP: 

si: 

SHIFT: 

sll: 

d: 

sra: 

d: 

srl: 

d: 

LOGICAL: 

and,  or,  xor: 

d  : 

ARITHMETIC: 

add: 

d  : 

addc: 

d  : 

sub: 

d  : 

subc: 

d  : 

subi,  subci: 

d  : 

s2<4:0> 


0  o  o\o  0 


d:  3  3  S  3  3  3  3 _ 

d:  [o  0  0  0  0  0|3  ~ 

(32-bit  bitwise  operations) 
d  :ss  si  OP  s2  ;  (OP:  AND,  OR,  nr  Exc 

(32-bit  2's-complement  operations) 


subi,  subci:  d  :=  32  -  si  [-NOT[c]{  ; 

cc  S:  Updated  iff  the  SCC-bit  (instruction<24>)  is  ON,  as  follows: 

Z  :=  [d==0];  N  :=  d<31>; 

shift,  logical  instructions:  V:=0;  C:=0; 

arithmetic's:  V  :=  [32-bit  2's-complement  overflow  occurred]; 
additions:  C  :=  carry<31>to<32>  (assuming  si,  s2:  unsigned); 
subtractions:  C  :=  NOT[borrow<31>to<32>]  (for  si,  s2:  unsigned). 


Figure  A.4.3:  ALU  and  Shift  Instructions. 


I.Vv^ -  -  •  *  V’ 


O  *v’  <'  O  Ov'  c*  c"  •  ' 


«.4t  K. 


additionally  update  the  Condition  Codes.  Remember  that  register-0  always  has 
the  value  0  and  that  writing  into  it  has  no  effect. 

Load  instructions:  Figure  A. 4.4  shows  all  the  instructions  of  the  third 
column  of  figure  A.4.2.  They  perform  read  accesses  from  the  virtual  address 
space.  The  effective  address  for  the  access  is  the  sum  of  rsl  and  shortS0URCE2 
for  register-indexed  loads  (which  have  the  letter  x  in  their  symbolic  opcodes),  or 
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the  sum  of  PC  and  imml9  for  PC-relative  loads  (which  have  the  letter  r  in  their 
symbolic  opcodes).  There  are  separate  load  instructions  for  words  (w),  half¬ 
words  (h),  and  bytes(b).  For  the  two  latter  cases,  there  are  instructions  for 
loading  those  quantities  as  signed  (s)  or  unsigned  (u)  numbers.  The  addressed 
word  is  first  read  from  memory,  as  discussed  in  §  A.2.  Then,  the  desired  part  of 
the  word  is  extracted  from  it.  right-aligned,  sign-extended  or  zero-filled,  and 
written  into  rd.  (The  "desired  part"  of  the  word  is  defined  by  the  instruction  and 
by  the  two  least-significant  bits  of  the  effective  address:  see  fig.  A.2.2(b).)  The 
CC's  may  optionally  be  updated  according  to  the  value  placed  into  rd. 

Store  instructions:  Figure  A. 4. 4  also  shows  all  the  instructions  of  the  fourth 
column  of  figure  A.4.2.  They  perform  write  accesses  into  the  virtual  address 
space.  The  effective  address  for  the  access  is  computed  in  a  fashion  similar  to 
that  for  loads,  except  that  for  register-indexed  addressing.  shortS0URCE2  must 
be  an  immediate  (§  3.1.2).  The  CPU  properly  aligns  the  item  to  be  stored, 
according  to  its  width  specifyed  by  the  instruction  type  and  to  the  2  LS-bits  of 
the  effective  address  (fig.  A.2.3).  It  also  outputs  information  about  the  item's 
width,  so  that  the  memory  selectively  writes  only  some  of  the  four  byte-banks  (§ 
A.2). 

Control -Transfer  Instructions:  Figures  A.  4. 5  through  A. 4. 7  describe  the 
jump,  subroutine  call,  and  return  instructions.  Figures  A. 4. 5  through  A.4.7 
describe  them.  The  (conditional)  jump  and  (unconditional)  call  instructions 


LOAD  INSTRUCTIONS: 


imml9 


(see  fig.  A.4.1)  (see  fig.  A.2.2(c)) 


I  ZNVC  (see  fig.  A.2.2(b)) 

Iff  SCC-bit  is  ON: 

Z:=[d==0];  N:=d<31>;  V:=0:  C:=0.  , _ m 

L  J  *  TEST  ALIGNMENT  !!: 

If  bad  (fig.  A-2.2(a)): 
ABORT  INSTRUCTION. 

STORE  INSTRUCTIONS:  TRAP  to  address: 

60000000  Hexadec. 


eff-addr. 


MEMORY 


align, 
sign-ext./ 
zero-fill  I  (32  bits) 


imm!3 


rd 


imm!9 


eff-addr. 


MEMORY 


align 


I  (fig-  A  2  3) 

ATTENTION!!!:  1 - 

ATTENTION!!!: 

ZNVC 

Indexed-store  instructions 

only  work  with  IMMEDIATE-OFFSET!!  SCC-bit  is  ON 

Their  IMM-bit  (instr<13>)  Z:=garbage:  N:=gs 

MUST  be  ON!! 

Otherwise,  the  effective-address  is  garbage!! 

(This  is  a  restriction  of  the  original  RISC  Architecture). 


Iff  SCC-bit  is  ON  (it  should  NOT!!): 
Z:=garbage;  N:=garbage;  V:=0;  C:=0. 


Figure  A.4.4:  Load  and  Store  Instructions. 


A.4 


jmpx,  callx, 
ret.  retl: 


Jmpr. 

callr: 


short 


(see  fig.  A. 4.1) 


imml9 


_  ATTENTION!!!: 

ZNVC  SCC-bit  MUST  be  OFF; 

-  otherwise:  Z.N.V.Cifgarbage, 

and  for  conditionals  (jmp/ret): 

eff-addr  :=  garbage  !!! 


Iff  condition  is  true 
(see  fig.  A.4.7). 


eff-addr.  \ 


NXTPC 


TEST  ALIGNMENT  !!: 

If  bad  (eff-addr<0>==l): 
ABORT  INSTRUCTION,  and 
TRAP  to  address: 
80000000  hexadecimal. 


Example: 


DELAYED  JUMP  SCHEME: 
(Result  of  Fetch/Execute  Overlap) 


NXTPC: 


MEMORY  ACTIVITY: 


100: 

ldrw  ... 

PC+200; 

204:  sub  .... 

104: 

jmpr  ... 

PC+100; 

208: 

108: 

add  .... 

•  ••• 

112: 

•  ••  • 

300:  data. 

•  ••• 

X 

104  ) 

<  108  X  204 

Fetch 

Load 

from  104 

from  300 

Fetch 
from  108 


Fetch 
from  204 


CPU  ACTIVITY: 


Execute 

ldrw 


Execute  Execute  Execute 

jmpr  add  sub 

time 


Figure  A.4.5:  Control  Transfer  -  Delayed  Jumps 
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Figure  A.4.6:  Control-Transfer  Instructions. 


Instructions: 

Effect  it  Notes: 

Jmpx.  jmpr: 

Iff  condition  is  true  (see  fig.A.4.7),  then  control  is  trans¬ 
ferred,  as  shown  in  fig.  A.4.5. 

callx,  callr: 

(1)  Transfer  Control  (see  fig.  A.4.5); 

(2)  CWP  :=  CWP-1  modulo  8  (change  window  -  fig.  A.  1.1). 

(3)  rd  :=  PC  (save  PC  into  destination-register); 

NOTES:  (a)  the  rsl  ( it  rs2)  register(s)  specified  in  the 

instruction,  are  read  from  the  OLD  window; 

(b)  the  PC  value  that  is  saved  is  the  PC  of  the 
call  instruction  itself; 

(c)  the  PC  is  saved  into  register  number  rd  of 
the  NEW  window*. 

(d)  if  the  change  of  CWP  would  result  in  a  new 
value  that  would  be  equal  to  SWP  (fig.  A.  1.1), 
then  the  call  instruction  is  ABORTED,  and  the 
processor  TRAPS  to  address  80000020  Hexadec. 

(if  PSW_I  is  ON)  (Reg-File  Overflow  occurred). 

ret: 

Iff  condition  is  true  (see  fig.  A.4.7),  then: 

(1)  Transfer  Control  (see  Fig.  A.4.5); 

(2)  CWP  :=  CWP+1  modulo  8  (change  window  -  fig.  A.1.1). 
NOTES:  (a)  the  rsl  ( it  rs2)  register(s)  specified  in  the 

instruction,  are  read  from  the  OLD  window; 

(b)  the  normal  use  of  this  instruction  is  with 
target  addr.  rsl+B  (with  rsl=rd  of  the  call). 

(c)  if  the  condition  is  true,  and  if  the 
change  of  CWP  would  result  in  a  new 

value  that  would  be  equal  to  SWP  (fig.  A.1.1), 
then  the  return  instr.  is  ABORTED,  and  the 
processor  TRAPS  to  address  B0000030  Hexadec. 

(if  PSW-I  is  ON)  (Reg-File  Underflow  occurred). 

retL* 

Iff  condition  is  true  (see  fig.  A.4.7),  then: 

(1)  Transfer  Control  (see  fig.  A.4.5); 

(2)  CWP  :=  CWP+1  modulo  8  (change  window  -  fig.  A.1.1). 

(3)  Modify  PSW:  I:=0N  (enable  interrupts);  S:=P  . 

NOTES:  Same  as  for  ret. 

compute  their  effective  target  address  like  the  load  instructions  do  (register- 
indexed  or  PC-relative).  Only  a  register-indexed  version  of  the  return  instruc¬ 
tion  is  provided,  because  that's  all  that  is  needed.  There  is  a  return-from- 
interrupt  instruction  which  will  be  discussed  in  §  A.5.  Return  instructions  are 
conditional.  Figure  A.4.8  contains  the  details  of  the  window  manipulation  for  call 


Figure  A.4.7:  The  RISC  II  Jump  Conditions. 


CODE 

SYMBOL 

NAME 

MEANING 

0001 

greater  than  (cmp  signed) 

(  N  ©  V  )  +  Z 

0010 

le 

less  or  equal  (cmp  signed) 

(  N  ©  V  )  +  Z 

0011 

ge 

greater  or  equal  (cmp  sign.) 

N  ©  V 

0100 

It 

less  than  (cmp  signed) 

N  ©  V 

0101 

hi 

higher  than  (cmp  unsigned) 

c+  z 

0110 

los 

lower  or  same  (cmp  unsign.) 

~+  z 

0111 

lo 

lower  than  (cmp  unsigned) 

_ 

c 

nc 

no  carry 

1000 

hit 

higher  or  same  (cmp  uns.) 

c 

c 

carry 

1001 

Pi 

plus  (tst  signed) 

¥ 

1010 

mi 

minus  (tst  signed) 

N 

1011 

ne 

not  equal 

T 

1100 

eq 

equal 

z 

1101 

nv 

no  overflow  (signed  arithm.) 

T 

1110 

Y 

overflow  (signed  arithmetic) 

V 

1111 

alw 

always 

i 

CODE:  This 

is  the  "cond 

''-field  (instruction<22:19>)  (see 

fig.  A.4.l(a)). 

SYMBOL:  This  is  how  the  condition  is  represented  in  Assembly. 
MEANING:  The  condition  is  true  if  and  only  if 

the  value  of  this  function  of  PSW<3:0>  is  1. 

©:  Exclusive-OR. 


and  return  instructions  (§  3.2.2).  Attention  is  drawn  to  the  fact  that  the  call 
instructions  read  their  source  register(s)  from  the  old  window,  whereas  they 
write  their  destination  into  the  new  one.  Figure  A.4.7  shows  the  branch 


conditions  employed  by  conditional-transfer  instructions. 


!• 

I 


! 


Figure  A.4.8:  Miscellaneous  Instructions. 


Instr.: 

Effect  &  Notes: 

Idhi: 

(1)  rd  :=  imml9  <<  13  —  see  figure  A2.2(d); 

(2)  Iff  SCC-bit  (instr. <24>)  is  ON.  then: 

Z  :=  [dest==0]  ;  N  :=  dest<31>  ;  V.C  :=  0. 

getlpc: 

(1)  rd  :=  LSTPC  (fig.  A.  1.1); 

(2)  Iff  SCC-bit  (instr.<24>)  is  ON.  then: 

Z  :=  [LSTPC==0]  ;  N  :=  LSTPC<31>  :  V.C  :=  garbage. 
NOTES:  (a)  the  rsl,  shortS0URCE2  fields  are  discarded; 

(b)  the  value  of  LSTPC,  which  is  saved  in  rd. 

is  equal  to  the  value  that  the  PC  had  during 
the  execution  of  the  previous  instruction. 

(c)  this  instr.  is  NOT  ti.'nsparent  to  interrupts. 

getpsir 

(l)  rd  :=  (-l)<31:13>  concatenated  PSW<12:0>  ; 

Iff  SCC-bit  is  ON.  (next-)  CC's  are  set  as  by  Idhi. 
ATTENTION!:  (a)  Previous  instr.  MUST  have  its  SCC-bit  OFF; 

(b)  IMM-bit  MUST  be  OFF.  rs2  MUST  be  rO  (to 
prevent  int-forw.).  Otherwise  rd:=garbagef! 
NOTES:  (a)  see  fig.  A  1.1  for  PSW;  (b)  rsl  is  discarded. 

putpsrr 

(1)  PSW  :=  [  rsl  +  shortS0URCE2  ]<12:0>  . 

ATTENTION!:  fa)  the  SCC-bit  MUST  be  OFF: 

(b)  the  following  instruction  must  NOT  be: 
callx,  callr,  calli,  ret,  reti  (i.e.  NOT 
modify  CWP),  and  must  NOT  set  the  CC’s. 

(c)  new  PSW  is  NOT  in  effect  during  the  first 
cycle  following  execution  of  this  instr. 

NOTES:  (a)  see  fig.  A  1.1  for  PSW;  (b)  rd  is  discarded. 

calli 

(1)  CWP  :=  CWP-1  modulo  8  (change  window  like  call). 

(2)  rd  :=  LSTPC  ;  CC's  possibly  affected;  like  getlpc. 

NOTES:  (a)  LSTPC  Is  saved  into  rd  of  the  NEW  window; 

(b)  the  rsl,  shortS0URCE2  fields  are  discarded: 

(c)  if  interrupts  are  enabled,  an  overflow  trap 
may  occur,  like  for  call  instructions. 

(d)  this  instr.  is  intended  for  use  only  by  the 

H/W  interrupt  mech.  (fig.A5.l)  —  NOT  by  S/W. 

Miscellaneous  instructions:  These  are  shown  in  figure  A.4.8.  The  Idhi 
instruction  is  used  for  loading  the  most-significant  part  of  long  immediate  con¬ 
stants  into  registers  (fig.  A. 2.2(d)).  The  getlpc  instruction  is  used  as  the  first 
instruction  of  every  interrupt-handler  routine  in  order  to  save  LSTPC  into  rd 
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(see  §  A.5).  The  getpsw  and  putpsw  instructions  are  used  for  manipulating  the 
PSW  in  an  arbitrary  fashion,  including  saving  it  before  interrupts  nest,  and  res¬ 
toring  it  at  a  later  time.  The  calli  instruction  is  used  by  the  hardware  interrupt 
mechanism.  The  getlpc  and  the  calli  read  a  part  of  the  i-h  programmer  visible 
state,  and  their  effect  is  therefore  not  transparent  to  interrupts  Calli  is 
intended  to  be  used  only  by  the  hardware  interrupt  mechanism,  and  getlpc 
should  be  used  only  as  the  first  instruction  of  an  interrupt  handling  routine  (§ 
A.5)  or  at  a  place  where  interrupts  are  disabled.  If  these  two  instructions  are 
executed  in  a  different  context,  with  interrupts  enabled,  their  result  depends  on 
whether  they  themselves  are  interrupted  or  not.  If  they  are  not  interrupted, 
they  will  yield  the  address  of  their  previous  instruction.  If  they  are  interrupted, 
they  will  yield  the  address  of  the  last  instruction  of  the  interrupt-handler  which 
serviced  the  interrupt. 
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Interrupts  and  Traps. 


Figure  A.5.1  describes  the  automatic  hardware  actions  which  occur  on 
interrupts  and  traps  and  the  possible  causes  and  respective  priorities  of 
interrupts/traps.  The  overall  efTect  of  an  interrupt/trap  is  that  it  transfers  con¬ 
trol  to  the  interrupt-handler,  as  if  the  interrupted  instruction  had  never  even 
started  executing.  (The  only  exception  is  on  external  interrupt  not  due  to  a 
page-fault  during  a  memory-write  cycle;  see  below.)  Also,  the  interrupt/trap 
mechanism  saves  enough  information,  so  that  later  the  normal-user  visible  state 
can  be  reconstructed  and  the  interrupted  instruction  can  be  restarted. 
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When  an  interrupt/trap  occurs,  the  normal-user  visible  state  (§  A.1)  is  pro¬ 
tected  from  being  altered  by  the  executing  instruction.  The  only  exception  is 
that  a  memory-write  access  in  progress  is  allowed  to  complete  if  it  can  do  so. 
This  is  possible  if  the  interrupt  was  not  due  to  a  page-fault  caused  by  that 
access.  Furthermore,  if  execution  has  proceeded  so  far  that  write  access  has 
been  started,  no  further  trap  (that  is  anomaly  originating  from  the  CPU)  can 
occur  during  the  execution  of  that  instruction,  since  all  traps  occur  during  the 
address  calculation  cycle  for  load  and  store  instructions.  In  other  words,  the 
only  possible  cause  for  the  abortion  of  this  instruction  might  have  been  an 
external  non-page- fault  interrupt.  Therefore,  if  the  memory-write  access  was 
allowed  to  complete,  it  must  have  been  a  correct  access.  When  the  same 
instruction  is  restarted  later,  after  the  interrupt-handling  routine  returns,  it  will 
repeat  the  same  write-access,  and  the  overall  result  will  be  the  same  as  if  the 
instruction  had  executed  only  once. 

While  most  of  the  normal-user  visible  state  is  protected  from  being  altered 
by  the  executing  instruction,  there  are  some  parts  of  it  that  must  be  altered 
before  the  interrupt-handler  can  start  executing.  Those  parts,  then,  must 
either  be  altered  in  a  known  and  reversible  manner,  or  their  previous  value 
must  be  saved  before  it  is  altered.  When  returning  from  the  interrupt,  the 
alterations  must  be  reversed  and  the  saved  values  restored. 

The  state  which  is  altered  in  a  reversible  way  is  the  PSW  I  bit  and  the  CWP. 
The  1  bit  was  ON  since  the  interrupt/trap  occurred;  and  it  is  turned  OFF.  The 
CWP  is  decremented  by  the  hardwired  calli  instruction.  This  decrementation  will 
always  occur,  since  the  calli  executes  with  interrupts  disabled.  Both  of  these 
changes  are  reversed  by  the  reti  instruction.  Note  that  the  I  bit  is  not  neces¬ 
sarily  ON  when  a  reset  occurs;  however,  when  a  reset  is  used  we  don't  care  to 
save  the  previous  state. 
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Figure  A.5.1:  Interrupts  and  Traps  in  RISC  II. 


Situation:  Activities,  Effects,  Notes: 

Interrupt  INTRODUCTORY  NOTES: 

or  Trap  (1)  Interrupts  (i.e.  external)  and  Traps  (i.e.  internal)  are 
occurs:  sampled/detected  near  the  "middle”  of  every  cycle; 

(2)  Instructions  "commit”,  i.e.  modify  the  user-visible  state 
of  the  CPU  (fig.  A  1.1),  near  the  end  of  their  "Execute" 
cycle. 

AUTOMATIC  (HARDWIRED)  ACTIVITIES  AND  EFFECTS: 

Assume  the  Interrupt/Trap  occurs  during  cycle#!. 

(1)  The  instruction  executing  during  cycle#!  is  ABORTED, 
i.e.  it  is  NOT  allowed  to  "commit"  —  Except  that:  (i)  the 
PC’s  operate  independently,  and  (ii)  a  memory-write  that 
may  have  started  will  be  allowed  to  complete  (if  it  may). 

(2)  The  instruction  that  has  been  fetched  during  cycle#i  (or 
i-l)  is  DISCARDED,  and  replaced  by  the  (hardwired)  instr.: 
galli.  SCC-OFF,  rd=25,  rsl-shortS0URCE2=garbage. 

(3)  The  PSW  (fig.  A  1.1)  is  modified  as  follows: 

I:=OFF  (disable  interrupts);  P:=S  ;  S:=ON  (system-mode). 

(4)  Instruction-Fetching  starts  at  the  address  specified  by 
the  Interrupt-Vector,  and  NXTPC  is  loaded  with  that  value. 

INTERRUPT  CAUSES.  AND  CORRESPONDING  VECTORS: 

(1)  Reset-pin  pulsed  high;  Vector=80000000  Hexadecimal. 

(2)  Interrupt-Request-Pin  pulsed  high;  Vector=80000010  Hexad. 

TRAP  CAUSES.  AND  CORRESPONDING  VECTORS: 

fl)  Illegal  opcode  (fig.  A4.2)  executed;  Vect=80000000  Hexad. 

(2)  Priviledged  opcode  (fig.  A4.2)  executed  while  S==OFF  (i.e. 
in  user-mode);  Vector= 80000000  Hexadecimal. 

(3)  Address-Misalignment  (fig.  A4.4-5);  Vect=80000000  Hexad. 

(4)  Reg-File  Overflow  (fig.  A4.6(call));  Vect=80000020  Hexad. 

(5)  Reg-File  Underflow  (fig.  A4.8(ret));  Vect=80000030  Hexad. 

INTERRUPT /TRAP  DISABLING: 

All  interrupts  and  traps,  except  the  one  caused  by  the  Reset- 
pin,  are  disabled  whenever  the  I  bit  of  PSW  is  OFF. 

PRIORITIES: 

In  case  more  them  one  interrupt/trap  causes  are  present  at 
once,  the  Vector  is  determined  according  to  the  priority: 
80000000  has  highest  priority,  80000020  emd  80000030  have 
medium  priority  (they  cannot  occur  simultaneously),  and 
80000010  has  lowest  priority. 

NOTES: 

(1)  In  cycle#(i+l),  the  hardwired  calli  instruction  will  execute, 
changing  the  window,  and  saving  LSTPC  into  reg.  25  of  the 
new  window.  The  value  saved  Is  equal  to  the  PC  of  cycle#i. 

(2)  In  cycle#(i+l),  the  instr.  ©  Vector  will  be  fetched,  and 
it  will  execute  in  cycle#(i+2).  That  instruction  must  be  a 
getlpc,  to  save  LSTPC  of  cycle#(i+2)  =  NXTPC  of  cycle#i. 
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The  part  of  the  state  that  is  saved  before  it  is  altered  consists  of  the  PSW  S 
bit,  the  NXTPC,  and  the  PC.  The  S  bit  is  saved  into  the  P  bit  by  automatic 
hardwired  action,  and  it  is  restored  by  the  reti  instruction.  The  PC  is  saved  into 
LSTPC,  which  is  subsequently  saved  into  register  25  of  the  new  window  by  the 
calli  instruction.  The  NXTPC  goes  into  PC,  which  then  goes  into  LSTPC,  and 
which  must  be  saved  by  the  first  instruction  of  the  interrupt-handler.  That 
instruction  must  be  a  getlpc,  and  it  will  place  this  value  typically  into  register 
24.  After  the  completion  of  the  interrupt-handling,  these  two  values  must  be 
restored  in  the  following  fashion: 

jmpx  (alw),  r25  +  0  ; 
reti  (alw),  r24  +  0  . 

Notice  that  the  detection  of  register-window  overflows  works  in  such  a  way 
that  registers  25  through  16  (the  locals)  of  the  window  just  below  the  current 
one,  are  guaranteed  to  be  free  whenever  interrupts  are  enabled,  provided  this 
condition  held  true  when  interrupts  were  originally  enabled  (§  3.2.2).  The  calli 
and  getlpc  instructions  save  LSTPC  into  two  of  those  local  registers  on  inter¬ 
rupts.  The  interrupt-handler  may  also  use  those  local  registers  (and  only  those) 
as  scratch  memory. 

While  the  normal-user  visible  state  is  saved  on  an  interrupt/trap,  the  i-h 
programmer  visible  state  (§  A.l)  is  nof  saved.  This  means  that  interrupts/traps 
must  NOT  nest  without  prior  appropriate  arrangements.  Thus,  before  the 
interrupt-handler  re-enables  interrupts,  or  modifies  the  CC’s,  or  uses  any  regis¬ 
ters  other  than  25  through  16,  or  calls  a  subroutine,  it  must  save  PSW  some¬ 
where  (e.g.  getpsw  r23,  rO,  rO),  and  make  sure  that  there  are  more  free  windows 
below  the  current  one.  As  the  last  step  before  returning  from  the  interrupt- 
handler  with  the  instruction-pair  jmpx -reti,  the  PSW  must  be  restored  (e.g. 
putpsw  rO,  r23,  rO).  This  restores  the  environment  and  the  state  that  existed 
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before  this  interrupt  occurred. 
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PROCESSOR  DESIGN  TRADEOFFS  IN  VLSI 


Robert  Warren  Sherburne,  Jr. 

ABSTRACT 

As  the  density  of  circuit  integration  is  increased,  management  of  complexity 
becomes  a  critical  issue  in  chip  design.  Hundreds  of  man-years  of  design  time 
are  required  for  complex  processors  which  are  presently  available  on  a  few 
chips.  This  high  cost  of  manpower  and  other  resources  is  not  acceptable.  In 
order  to  address  this  problem,  the  Reduced  Instruction  Set  Computer  (RISC) 
architecture  relies  on  a  small  set  of  simple  instructions  which  execute  in  a  regu¬ 
lar  manner.  This  allows  a  powerful  processor  to  be  implemented  on  a  single  chip 
at  a  cost  of  only  a  few  man  years.  A  critical  factor  behind  the  success  of  the 
RISC  II  microprocessor  is  the  careful  optimization  which  was  performed  during 
its  design.  Allocation  of  the  limited  chip  area  and  power  resources  must  be 
carefully  performed  to  ensure  that  all  processor  instructions  operate  at  the 
fastest  possible  speed.  A  fast  implementation  alone,  however,  is  not  sufficient; 
the  designer  must  also  consider  overall  performance  for  typical  applications  in 
order  to  ensure  best  results.  Areas  of  processor  design  which  are  analyzed  in 
this  work  include  System  Pipelining,  Local  Memory  Tradeoffs,  Datapath  Timing, 
and  ALU  Design  Tradeoffs.  Pipelining  improves  performance  by  increasing  the 
utilization  of  the  datapath  resources.  This  gain  is  diminished,  however,  by  data 
and  instruction  dependencies  which  require  extra  cycles  of  delay  during  instruc¬ 
tion  execution.  Also,  the  larger  register  file  bitcells  which  are  needed  in  order 
to  support  concurrency  in  the  datapath  incur  greater  delays  and  reduce  system 
bandwidth  from  the  expected  value.  Increased  local  memory  (or  register  file) 


capacity  significantly  reduces  data  1/0  traffic  by  keeping  needed  data  frequently 
in  registers  on  the  chip.  Too  much  local  memory,  though,  can  actually  reduce 
system  throughput  by  increasing  the  datapath  cycle  time.  Various  ALU  organi¬ 
zations  are  available  to  the  designer,  here  several  approaches  are  investigated 
as  to  their  suitability  for  VLSI.  Carry  delay  as  well  as  power,  area,  and  regularity 
issues  are  examined  for  ripple,  carry-select,  and  parallel  adder  designs.  First,  a 
traditional,  fixed-gate  delay  analysis  of  carry  computation  is  performed  over  a 
range  of  adder  sizes.  Next,  delays  are  measured  for  NMOS  implementations  util¬ 
izing  dynamic  logic  and  bootstrapping  techniques.  The  results  differ  widely:  the 
fixed-delay  model  shows  the  parallel  design  to  be  superior  for  adders  of  16  bits 
and  up.  while  the  NMOS  analysis  showed  it  to  be  outperformed  by  the  carry- 
select  design  through  128  bits.  Such  a  result  underscores  the  need  to  reevalu¬ 
ate  design  strategies  which  were  traditionally  chosen  for  TTL-based  implementa¬ 
tions.  Single-chip  VLSI  implementations  impose  a  whole  new  set  of  constraints. 
It  is  hoped  that  this  work  will  bring  out  the  significance  of  evaluating  the  design 
tradeoffs  over  the  whole  spectrum  ranging  from  the  selection  of  a  processor 
architecture  down  to  the  choice  of  the  carry  circuitry  in  the  ALU. 

In  this  research  I  was  supported  for  three  years  by  a  General  Electric  doc¬ 
toral  fellowship.  The  RISC  project  was  supported  in  part  by  ARPA  Order  No.  3803 
and  monitored  by  NESC  #N00039-78-C-00 13-0004. 
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CHAPTER  1: 

D 

INTRODUCTION 

» 

> 

In  the  world  of  integrated  circuits  a  revolution  is  taking  place.  Silicon 
chips,  which  only  a  decade  ago  contained  several  transistors  or  logic  gates,  now 

^  accommodate  up  to  hundreds  of  thousands  of  transistors.  Several  32-bit  Central 

Processing  Unit  (CPU)  implementations,  as  well  as  84  and  256  Kilobit  dynamic 
Random  Access  Memories  (RAMs),  have  been  produced  on  a  single  chip.  Higher 
levels  of  integration  offer  systems  which  are  not  only  smaller,  cheaper,  and  less 

) 

costly  to  operate  than  their  predecessors:  they  offer  higher  performance  as 
welL  By  shrinking  the  circuitry  so  that  it  can  reside  on  less  than  a  square  cen¬ 
timeter  of  silicon  area,  wire  delays  are  reduced  dramatically.  As  a  result,  the 
user  obtains  higher  performance  at  lower  cost. 

The  rising  complexity  confronting  the  designer  is  a  serious  concern  as  dev¬ 
ice  capacity  on  a  chip  increases.  The  design  of  a  typical,  32-bit  microprocessor 
requires  30  or  more  man-years.  If  this  were  performed  by  a  single  person,  the 
fabrication  technology  will  have  changed  so  much  that  the  original  assumptions 
made  regarding  chip  constraints  would  be  grossly  invalid.  In  order  to  shorten 
the  elapsed  design  time,  chips  are  partitioned  into  modules,  each  of  which  is 
constructed  by  a  design  team.  Each  team  optimizes  its  module  while  conform- 


ing  to  the  specifications  assigned  by  the  project  leader  or  manager.  This  divide- 
and*conquer  approach  is  the  traditional  methodology  in  industry,  where  early 
product  release  yields  a  high  return  on  the  initial  investment. 

A  disadvantage  of  the  divide-and-conquer  strategy  is  that  no  design  team  is 
familiar  with  the  chip  as  a  whole.  This  makes  global  optimization  difficult,  if  not 
impossible.  The  overall  organization  of  the  system  and  its  microarchitecture 
sets  a  fundamental  limit  on  performance.  A  poor  choice  of  microarchitecture 
will  render  a  design  doomed  to  a  short  life,  if  not  outright  failure,  in  the  market¬ 
place.  The  microarchitect's  responsibility  is  to  address  this  issue.  He  must  be 
familiar  with  the  architecture  as  well  as  the  constraints  of  the  fabrication  tech¬ 
nology.  Since  the  chip  constraints  are  constantly  changing,  design  decisions 
must  be  periodically  reevaluated. 

Traditional  CPU's  such  as  in  the  IBM  360/370,  DEC  VAX-11/790,  and  so  forth 
consist  of  several  circuit  boards,  filled  with  standard  Small  and  Medium-Scale 
Integration  (SSI  and  MSI)  packages.  Performance  is  limited  by  the  logic  delays 
and  wiring  delays  associated  with  the  many  inter-chip  communication  paths. 
Bipolar  chips  are  normally  used  in  such  designs  because  they  offer  high  tran¬ 
sconductance.  This  means  that  a  greater  output  current  drive  is  available, 
which  is  important  for  reducing  wire  delays.  Increasing  the  speed  of  signal  pro¬ 
pagation  between  chips  requires  more  power  per  chip.  Expensive  cooling  sys¬ 
tems  must  then  be  added  in  order  to  control  chip  temperature. 

In  contrast,  more  recent  32-bit  CPU  designs  have  been  implemented  by 
Very  Large-Scale  Integration  (VLSI)  on  a  single  chip.  As  a  larger  fraction  of  the 
system  is  placed  on  a  chip,  interchip  wiring  delays  are  reduced.  Interchip 
delays  are  replaced  with  smaller,  on-chip  wire  delays,  increasing  system  perfor¬ 
mance.  A  ceiling  for  maximum  transistor  count  is  set  by  the  limited  chip  area 
and  power  dissipation  of  the  technology.  Because  Metal-Oxide  Semiconductor 


(MOS)  transistors  are  smaller  and  consume  less  power  than  their  bipolar  coun¬ 
terparts,  they  are  more  attractive  for  VLSI.  The  poor  transconductance  of  the 
MOS  transistor  increases  off-chip  signal  delay,  but  this  is  offset  by  the  reduced 
number  of  such  delays  inherent  in  higher  levels  of  integration.  MOS  is  presently 
the  most  popular  fabrication  technology  for  VLSI  systems. 

A  VLSI  processor  is  expensive  to  develop.  Instead  of  relying  on  available 
SSI/MSI  parts,  the  designer  must  perform  circuit  design,  layout,  and  simulation 
at  the  device  level.  Optimization  of  one  module  on  the  chip  affects  the  area, 
power,  and  timing  available  for  the  other  modules.  Wiring  is  costly  in  terms  of 
chip  resources:  a  32-bit  bus  occupies  a  large  amount  of  area  in  the  planar  lay¬ 
out.  The  number  of  Input/Output  (I/O)  pads  is  limited  by  chip  periphery.  This 
complicates  testing  by  restricting  access  to  internal  state.  Redesign  is  costly 
because  new  masks  and  wafers  are  required,  delaying  the  product  several 
months.  As  a  consequence,  the  board-level  design  is  more  attractive  for  imple¬ 
menting  complicated  processors. 

The  high  costs  associated  with  a  single-chip  design  are  mainly  due  to  com¬ 
plexity  in  design  and  testing.  These  penalties  of  single-chip  implementation,  can 
be  alleviated  by  simplifying  the  CPU  design.  By  reducing  complexity  and  inter¬ 
nal  state  of  the  machine,  it  will  be  simpler  to  design  and  test,  and  will  be  more 
likely  to  be  free  of  design  errors  on  the  first  try. 

Several  NMOS,  single-chip  VLSI  microprocessors  have  been  designed  with 
this  idea  in  mind.  The  RISC  I  [l],  RISC  II  [2],  and  MIPS  [3]  implementations  util¬ 
ize  low-level  instructions,  each  of  which  requires  a  single  machine  cycle  to  exe¬ 
cute.  This  regular  execution  timing  simplifies  the  application  of  pipelining  for 
high  performance  while  remaining  conceptually  simple.  Less  instruction  decod¬ 
ing  is  necessary,  allowing  small,  fast  Programmed  Logic  Arrays  (PLAs)  or  simple 


decoders  to  be  used. 


This  contrasts  with  commercial  single-chip  microprocessors,  which  dedi¬ 
cate  over  half  the  die  area  to  microcode  Read-Only  Memory  (ROM)  for  instruc¬ 
tion  decoding.  In  the  Reduced  Instruction  Set  Computer  (RISC)  design,  area 
freed  up  by  the  reduced  control  circuitry  is  used  for  an  expanded  register  file. 
Depending  on  the  application  environment,  other  functions  may  be  incorporated 
on  chip  instead  with  the  freed-up  area  to  improve  system  performance. 

In  a  single-chip  design,  datapath  speed  is  a  limiting  factor  in  system  perfor¬ 
mance.  The  datapath  consists  of  the  functional  modules  which  manipulate  data 
and  provide  for  its  temporary  storage.  Additionally,  as  its  name  implies,  it 
includes  the  paths  over  which  data  may  flow  between  these  modules.  Datapath 
cycle  time  is  determined  by  delays  in  the  main  datapath  modules  and  the  com¬ 
munication  overhead  incurred  to  and  from  these  modules  during  the  machine 
cycle. 

In  order  to  minimize  the  datapath  cycle  time,  the  microarchitect  deter¬ 
mines  what  datapath  modules  are  necessary  and  how  they  are  orchestrated  dur¬ 
ing  the  cycle.  First,  a  functionally  partitioned,  two-dimensional  representation 
of  data  flow  is  composed.  Timing  schemes  are  formulated  which  maximize  con¬ 
currency  of  data  operations  and  data  flow.  Next,  the  modules  are  optimally 
placed  on  the  chip  for  minimum  communication  path  delay  and  area.  Several 
iterations  may  be  required  in  order  to  stay  within  the  limits  of  the  chip 
resources. 

This  thesis  studies  fundamental  design  tradeoffs  of  importance  to  the 
microarchitect  of  a  VLSI  processor.  All  of  these  tradeoffs  are  interrelated;  the 
microarchitect  must  decide  which  tradeoffs  to  make  for  best  performance 
under  specified  conditions.  These  conditions  include  the  fabrication  technology 
and  amount  of  chip  resources  available,  as  well  as  the  type  of  environment  for 
which  the  system  is  designed  for. 
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Pipelining  at  the  system  level  is  investigated  in  Chapter  2.  It  presents  the 
timing  of  the  basic  operations  to  be  performed  using  increasing  levels  of  con¬ 
currency.  It  is  when  overall  datapath  timing  and  resource  allocation  are  deter¬ 
mined  that  optimal  topology  of  information  flow  must  be  considered. 

Within  the  datapath  itself,  the  local  memory  (register  file)  and  ALU  delays 
limit  the  speed  which  may  be  attained.  A  larger  local  memory  effectively 
reduces  data  traffic  arising  from  procedure  calls  and  returns.  Datapath 
bandwidth,  however,  is  reduced  due  to  the  increased  register  cycle  time.  A 
conflict  then  exists  between  the  desire  for  maximum  datapath  bandwidth,  and 
the  need  to  reduce  data  I/O  overhead. 

Since  the  local  memory  may  occupy  a  significant  portion  of  the  chip,  it  is 
important  to  ensure  that  it  makes  effective  use  of  available  resources.  Pro¬ 
gramming  environments  which  include  many  nested  procedures  benefit 
significantly  from  a  multiple-bank  local  memory  scheme.  On  the  other  hand, 
those  with  few  procedures  may  suffer  from  the  increased  register  cycle  time.  In 
addition  to  the  programming  environment,  the  register-bank  swapping  strategy, 
overflow  interrupt  overhead,  and  data  I/O  bandwidth  affect  optimal  memory 
size.  These  tradeoffs  are  investigated  in  Chapter  3. 

Datapath  bandwidth  may  be  improved  by  pipelining  the  read  and  write 
operations  of  the  register  file.  Different  bit  cells  are  required  for  different  levels 
of  pipelining.  In  general,  the  number  of  wordlines  and  bitlines  in  the  cell  must 
increase  with  higher  concurrency.  Register  cell  design  issues  relating  to  data¬ 
path  timing  are  investigated  in  Chapter  4. 

ALU  delay  can  also  be  improved  with  increased  parallelism.  Adder  delay 
analysis  has  traditionally  been  performed  using  the  notion  of  fixed  gate  delay. 
This  is  appropriate  for  TTL  implementations  where  performance  is  dominated  by 
on-chip  buffer  delay.  Such  an  approach  is  not  suitable  for  VLSI,  where 


performance  becomes  limited  by  wiring  and  transistor  parasitics.  In  Chapter  5. 
relative  performance  of  ripple,  partial  lookahead,  conditional  carry,  and  parallel 
adders  is  analyzed  and  compared  for  both  the  constant  gate  delay  model  and  an 
NMOS  model  which  takes  into  account  device  parasitics  and  permits  evaluation 
of  alternative  circuit  design  strategies  for  increased  performance. 

Interactions  between  these  areas  of  design  tradeoffs  are  investigated  in 
Chapter  8.  Designing  for  limited  chip  area  and  power  resources  is  also  dis¬ 
cussed.  Conclusions  drawn  from  these  areas  of  analysis  are  summarized  in 
Chapter  7. 
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SYSTEM  PIPELINING 


The  goal  of  pipelining  is  to  make  more  effective  use  of  system  resources, 
and  in  so  doing,  increase  performance.  Greater  levels  of  pipelining  lead  to 
increased  concurrency.  The  execution  of  one  instruction  (or  microinstruction) 
may  overlap  initiation  and  completion  of  several  others.  This  will  be  illustrated 
for  datapath  timing  in  a  later  chapter.  In  this  chapter  we  will  take  into  account 
the  interaction  with  external  (off-chip)  memory. 

Performance  Improvement  by  Pipelining 


.  •  -  • . 


Phase  1 


Phase  2 


Phase  3 


Instruction  Memory: 
Instruction  Address 
Instruction  Fetch 
Instruction  Decode 

Execution: 

Operand  Fetch 
R-R  Execution 
Write  Result 

Data  Memory- 
Data  Store 
Data  Fetch 
Alignment 


Figure  1:  The  Three  Phases  in  Instruction  Processing 


B 


There  are  several  sequential  steps  involved  in  the  execution  of  a  single 
memory-to-memory  instruction.  The  first  phase  of  this  cycle  consists  of  the 
instruction  fetch  and  decode.  The  next  phase  encompasses  the  register-to- 
register  operation  within  the  CPU,  where  data  modifications  are  performed.  For 
data  I/O  (Input/Output)  instructions,  this  cycle  consists  of  address  calculations 
for  LOADs,  STOREs,  or  JUMPs.  Finally,  there  is  a  third  phase  for  LOAD  and  STORE 
instructions  during  which  they  access  data  memory.  Support  for  sub*word  data 
(e.g.  bytes  or  half-words)  may  be  included  here.  These  three  phases,  each  sub¬ 
divided  into  three  subphases,  utilize  different  resource  groups  (Figure  1). 


Figure  2:  Timing  of  Sequential  Execution 

Timing  for  simple,  serial  execution  is  illustrated  in  Figure  2.  Shown  are  five 
instructions,  three  of  which  access  data  memory.  This  approach  makes  poor 
usage  of  the  resources  on  chip.  For  example,  the  ALU  is  used  during  less  than 
half  of  the  phases  (one  third  for  data  I/O  instructions).  The  I/O  bus  is  not  used 
at  all  during  the  execution  phase. 

Even  very  simple  pipelining  can  increase  performance  substantially.  A  2- 
way  pipelined  scheme  is  shown  in  Figure  3,  in  which  instruction  letching  over¬ 
laps  execution  of  the  previous  instruction.  This  yields  up  to  twice  the  bandwidth 
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LOAD  A-M 

LOAD  B-M 

ADD  C«-A+B 

STORE  M*-C 

BRANCH  X 

NOOP 


Figure  3:  Two-Way  Pipelined  Timing  Showing  Branch  Delay 

of  the  serial  scheme.  It  is  assumed  that  only  a  single  I/O  operation  is  permitted 
in  each  phase,  thus  a  wait-state  is  inserted  while  data  I/O  takes  place.  Hence, 
performance  is  I/O  limited  —  a  single  memory  access  occurs  in  each  phase.  This 
is  the  timing  scheme  of  the  RISC  1  and  RISC  II  microprocessors  [l]. 

In  the  event  of  a  program  branch,  the  branch  address  calculation  is  not 
completed  until  the  following  instruction  has  been  fetched.  This  results  in  a 
delay  of  one  phase  before  the  target  address  is  ready.  In  order  to  accommodate 
this  delay,  a  NOOP  (NO  OPeration)  instruction  is  inserted  after  the  branch 
instruction.  This  creates  overhead  for  all  program  branches.  This  overhead 
may  be  reduced  by  redefining  the  branch  instruction  in  such  a  way  that  it  is 
supposed  to  take  effect  only  after  the  subsequent  instruction.  The  code  can 
then  be  reordered  so  that  the  instruction  after  the  branch  performs  useful  work 
prior  to  the  branch  occurrence  [2].  Otherwise,  a  wait-state  (in  the  form  of  a 
NOOP)  is  required.  Code  reorganization  for  this  "delayed  branch"  optimization 
will  be  discussed  in  more  detail  later  in  this  chapter. 


e 
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LOAD 

A-M 

LOAD 

B-M 

NOOP 

ADD 

C*-A+B 

STORE 

M-C 

BRANCH 

X 

NOOP 

Figure  4:  Three-Way  Pipelined  Timing  Shoeing  Data  Delay 

The  pipelining  can  be  further  improved  by  permitting  two  memory  I/O 
operations  per  phase,  as  shown  in  Figure  4.  Multiple  I/O  operations  per  phase 
may  be  attained  either  by  multiplexing  a  single  port  or  by  replicating  ports.  In 
this  pipelining  scheme  a  new  instruction  is  initiated  each  phase,  and  as  many  as 
three  instructions  may  overlap.  Unoptimized  branch  delay  remains  at  one 
phase,  as  in  the  previous  scheme.  However,  the  overlapped  data  memory  access 
may  now  require  wait  states  following  LOAD  instructions.  As  indicated  in  the 
figure,  the  data  fetch  is  not  completed  prior  to  the  execution  of  the  following 
instruction.  If  this  following  instruction  utilizes  the  fetched  data  as  one  of  its 
operands,  it  must  wait  one  phase.  Code  reorganization  for  data  dependency 
optimization  is  similar  to  that  for  delayed  branches  (to  be  discussed).  At  best, 
this  approach  is  up  to  three  times  faster  than  that  of  the  sequential  approach. 

A  four-way  pipelined  timing  scheme  is  detailed  in  Fgure  5.  The  execution 
phase  is  divided  into  two  sub-phases:  E\  (register  file  read,  with  internal 
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Figure  5:  Four-Way  Pipelined  Timing 

forwarding  and  overlapped  write)  and  Ez  (ALU  or  shifter  operation).  The  ALU 
and  register  file  are  used  in  every  phase;  this  allows  maximal  resource  usage. 
Overhead  due  to  unoptimized  data  dependencies  remains  a  single  phase  if  the 
ALU  result  is  directly  forwarded  to  the  concurrent  read  phase  of  the  following 
instruction.  Unoptimized  branch  overhead,  however,  is  now  two  phases  per 
branch.  Two  memory  accesses  per  phase  must  be  supported.  This  requires  two 
I/O  busses,  as  in  the  TMS  320  [3]  and  the  MIPS  [4]  microprocessors.  Addition¬ 
ally,  the  memory  I/O  cycle  is  now  shortened  to  match  half  the  execution  time, 
so  faster  memory  is  necessary  in  order  to  realize  the  full  gain  of  this  added  level 
of  pipelining. 
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The  processor-limited  execution  time  for  a  given  program  using  the  various 
pipelining  schemes  discussed  may  be  given  as: 

I.  Sequential  2N  +  D 

II.  2-Way  Pipelined  N  +  D  +  (J) 

III.  3-Way  Pipelined  N  +  (D)  +  (J) 

IV.  4-Way  Pipelined  “[  N  +  (D)  +  (  2J)  ] 

where  N  represents  the  number  of  coded  instructions,  with  D  data  accesses  and 
J  jumps.  Time  is  normalized  with  respect  to  a  single  register  cycle.  Optimizable 
overhead  (branching,  data  dependencies)  is  shown  in  parentheses.  Ideally,  the 
factor  a  is  2  for  the  4-way  scheme,  assuming  that  the  factor  a  represents  the 
available  memory  cycle  speedup  available  over  the  previous  schemes.  Maximum 
performance  improvement  occurs  for  a =2. 

A  comparison  of  ideal  performance  (fully  optimizable)  for  these  approaches 
is  shown  in  Figure  6.  Results  are  normalized  with  respect  to  the  non-pipelined 
(sequential)  case;  the  four-way  scheme  assumes  a=2.  Bandwidth  reduction 
caused  by  added  data  I/O  cycles  is  shown  in  the  shaded  areas  on  the  graph:  the 
full  shaded  penalty  occurs  for  the  case  of  all  instructions  performing  data 
LOADs.  A  fraction  of  this  area  would  then  pertain  to  the  actual  data  I/O  cost  for 
a  particular  program. 
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Figure  6:  Performance  Comparison  of  Pipelined  Schemes  (for  a=2) 
(shaded  area  indicates  maximum  possible  data  I/O  overhead) 

Optimization  of  Pipeline  Dependencies 

Ideal  performance  is  proportional  to  the  number  of  pipelining  levels 
employed  in  the  system.  However,  overhead  due  to  data  and  instruction  depen¬ 
dency  reduces  the  efficiency  of  pipelining.  This  overhead  is  most  significant  for 
highly-pipelined  machines.  Code  reorganization  at  the  register-to-register 
instruction  level  may  reduce  this  penalty  inherent  in  pipelined  implementations. 
The  effective  use  of  on-chip  local  memory  also  reduces  data  dependencies,  by 
reducing  data  I/O  traffic.  Design  tradeoffs  concerning  local  memory  ’"rill  be 
investigated  in  a  later  chapter. 

Code  is  optimized  by  reordering  so  that  the  required  data  or  instruction  is 
available  when  needed.  The  optimized  jump  or  load  (in  the  event  of  a  branch 
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slot  or  data  dependency  optimization,  respectively)  is  performed  earlier  than  is 
needed  in  a  sequentially  executing  program.  Useful  work  may  be  done  in  the 
interim  if  the  optimizing  compiler  can  find  an  instruction  to  put  into  the  empty 
slot.  This  is  the  goal  of  the  delayed  jump  scheme  [2].  This  reordering  is  not 
easily  done  for  conditional  jumps  for  which  branch  prediction  techniques  must 
be  utilized.  Data  from  the  VAX-1 1/780  indicate  that  conditional  branches  consti¬ 
tute  7%  to  17%  of  the  dynamic  instruction  count  [5].  This  penalty  may  be  con¬ 
sidered  acceptable  unless  the  number  of  pipelining  levels  is  high.  Stack 
machines  are  less  flexible  in  terms  of  code  reorganization;  for  this  reason  only 
register-based  machines  will  be  considered. 

Conditional  and  unconditional  branches,  whether  absolute  or  relative,  con¬ 
stitute  less  than  25%  of  typical  programs.  Subsequent  to  optimization,  unfilled 
branch  slots  for  the  MIPS  processor  vary  from  3%  to  24%  (for  a  single  slot  per 
branch),  and  21%  to  50%  (for  two  slots  per  branch)  [6].  Results  of  this  optimiza¬ 
tion  vary  depending  on  the  programming  environment  and  compiler  technology. 
The  IBM  801  compiler  "...is  able,  generally,  to  convert  about  60%  of  the  branches 
in  a  program  into  the  execute  form.’’[7].  Figure  7(a)  illustrates  the  effect  of 
unoptimized  branches  on  overall  performance.  In  (b)  this  is  compared  among 
the  four  timing  schemes.  Estimated  results  of  optimization  are  presented  in  (c), 
assuming  the  worst  case  MIPS  unfilled  branch  slot  incidence  mentioned  above 
(24%  and  50%).  As  expected,  the  overhead  of  unfilled  branch  slots  increases  with 
pipelining.  This  overhead  need  not  be  directly  proportional  to  the  number  of 
branches;  the  graph  indicates  an  upper  bound  for  convenience  of  discussion. 

Dependencies  arising  from  LOADs  and  branches  are  similar,  with  the  excep¬ 
tion  that  the  former  refers  to  data  memory,  and  the  latter  refers  to  instruction 
memory.  Such  dependencies  are  inherently  reduced  with  a  register-based 
architecture.  This  is  because  the  frequency  of  LOADs  is  reduced  by  depending 
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(b)  Comparison  among  pipeline  schemes 


branch  Incidence 


(c)  Comparison  after  optimization 


Figure  7:  Dependency  Overhead  and  Optimization 

mainly  on  local  register  storage  of  operands.  It  is  expected  that  optimized  data 
dependency  overhead  will  be  less  then  that  for  branches,  since  the  dynamic 
LOAD  count  is  typically  less  than  15%,  which  is  observed  for  Quicksort  [2].  Per¬ 
formance  overhead  of  data  dependencies  may  be  determined  with  the  aid  of  Fig¬ 
ure  7. 

Execution  time  overhead  due  to  dependencies  is  also  accompanies  by  a 


corresponding  increase  in  code  size  due  to  the  NOOP  instructions.  Dependency 
optimization  significantly  reduces  this  overhead  by  replacing  NOOPs  with  useful 
instructions.  Elimination  of  remaining  NOOPs  may  be  accomplished  by  encoding 
wait  states  in  the  instructions  responsible  for  the  dependencies. 

LOAD  A»-M 

LOAD  B-M 

BRANCH  X 

ADD  C«-A+B 

STORE  M-C 

Figure  B:  Reorganized  Code  for  Four-Way  Pipeline 

After  code  reorganization  for  dependency  optimization,  the  four-way  pipe¬ 
lined  instruction  sequence  of  Figure  5  appears  as  shown  in  Figure  8.  This  exam¬ 
ple  includes  a  performance  improvement  of  8055  as  well  as  a  3755  reduction  in 
code  size. 

This  code  optimization  at  the  level  of  the  machine  cycle  is  important  for 
highly  pipelined  machines.  Complex  instructions  cannot  be  reordered  at  this 
level,  and  as  a  result  perform  poorly.  Discussion  of  this  issue  for  the  MULTIPLY 
instruction  is  given  in  [Bj. 

Pipelining  Datapath  Modules 

Pipelining  may  also  be  applied  at  the  submodule  level  within  the  datapath. 
For  example,  the  ALU  and  register  file  may  each  exploit  several  levels  of  con¬ 
currency.  Dependencies  have  been  investigated  as  one  side-effect  which  limits 
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performance  of  a  pipelined  system.  Additional  costs  of  pipelining  are  the  addi¬ 


tional  storage  elements  and  associated  clocks  needed  to  hold  intermediate 


results.  A  high  degree  of  pipelining  entails  more  circuitry  for  these  functions.  A 


consequence  may  be  extra  delay  in  the  throughput  of  all  instructions.  This 


delay  results  from  the  propagation  time  through  the  added  storage  elements 


and  data  skewing,  as  well  as  the  required  clock  setup  time  for  latching  the  inter¬ 


mediate  results.  This  overhead  can  be  significant  in  a  highly-pipelined  module. 


such  as  the  parallel  adder  example  in  [9].  Datapath  pipelining  will  be  considered 


in  more  detail  in  Chapter  4.  Typically,  pipelining  in  a  datapath  module  is 


employed  only  to  the  degree  that  it  helps  to  alleviate  a  severe  bottleneck  in  sys¬ 


tem  performance.  The  degree  of  attainable  pipelining  is  then  determined  by  the 


slowest  link  in  the  system  which  cannot  be  improved;  often  this  is  the  I/O  cycle. 


The  following  chapter  will  address  I/O-limited  performance  issues. 
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CHAPTER  3: 


LOCAL  MEMORY  TRADEOFFS 


A  fundamental  limitation  to  processor  performance  is  set  by  the  ratio  of  the 
amount  of  memory  traffic  and  the  available  I/O  bandwidth.  The  bandwidth  limit 
for  a  given  technology  is  set  by  area  and  power  constraints.  Only  a  limited 
number  of  I/O  pads  with  their  associated  driver  circuits  can  be  placed  on  the 
chip  periphery.  Wire  bonding  technology  has  not  followed  in  the  footsteps  or  the 
shrinking  transistor;  pad  size  has  remained  constant  over  the  years.  Power  dis¬ 
sipation  of  I/O  drivers  is  determined  by  a  delay-power  product,  because  the  off- 
chip  loading  is  primarily  capacitive.  Multiplexing  the  pads  for  several  I/O  tran¬ 
sactions  per  cycle  requires  a  faster  settling  time,  and  hence  greater  power  dissi¬ 
pation. 

Memory  traffic  consists  of  two  classes  of  information:  instructions  and 
data.  Several  options  are  available  for  reducing  either  component.  At  a  high 
level,  the  set  of  machine  instructions  may  be  designed  to  include  powerful  con¬ 
structs  which  are  equivalent  to  many  simple  instructions.  This  has  been  done 
traditionally  for  large  mainframe  computers.  There  is  much  debate,  however, 
with  regard  to  its  use  in  VLSI  implementation.  Advocates  of  the  simpler  Reduced 
Instruction  Set  Computer  (RISC)  maintain  that,  within  the  constraints  of  single¬ 
chip  implementatioa  a  complex  instruction  set  is  a  poor  use  of  limited  chip 
resources  [l].  Microprocessors  to  date  have  devoted  the  majority  of  their  die 
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area  to  instruction  microcode  ROM.  RISC  implementations  utilize  this  area  to 
provide  more  local  memory.  Use  of  local  memory  keeps  much  of  the  needed 
data  local  to  the  processor  and  allows  data  traffic  to  be  reduced.  A  register- 
based  machine,  for  example,  can  store  frequently-used  operands  in  a  fast, 
multiple-port  register  file. 

Register  allocation  is  performed  by  the  compiler  and  requires  no  hardware 
overhead.  It  is  performed  independently  for  each  subroutine,  thus  procedure 
calls  require  separate  register  banks  or  blocks  of  local  memory.  Register  con¬ 
tents  may  have  to  be  swapped  out  of  local  memory  in  order  to  make  room  for 
the  next  procedure.  The  I/O  overhead  entailed  is  costly  and  may  actually 
increase  data  traffic  over  that  required  by  off-chip  operand  storage.  In  order  to 
overcome  this  performance  degradation,  the  RISC  I  microprocessor  organizes 
its  local  memory  as  multiple  register  banks,  with  each  bank  supporting  a 
different  procedure  level  [l].  In  addition,  adjacent  banks  overlap  partially  in 
order  to  facilitate  parameter  passing  among  subroutines.  This  approach  drasti¬ 
cally  reduces  data  I/O  traffic. 

A  multiprogramming  or  multitasking  environment  puts  even  more  stringent 
demands  on  the  performance  of  local  memory.  During  each  context  switch  a 
new  register  bank  must  be  made  available  for  the  next  executing  program.  In 
the  case  of  a  single  register  bank,  its  contents  must  be  saved  during  every  con¬ 
text  switch.  Multiple  banks  reduce  this  overhead  by  allowing  context  informa- 


switching  support  must  accommodate  register  overflows.  When  the  number  of 
banks  is  exceeded,  some  swapping  to  external  memory  is  required  in  order  to 
make  room  on  chip.  When  the  capacity  of  a  given  bank  is  exceeded,  external 
memory  storage  must  be  used,  to  store  additional  operands. 

The  alternative  to  the  approach  discussed  above  is  a  pure  memory-to- 
memory  architecture.  With  this  scheme  all  data  are  stored  in  external  memory, 
and  no  saving  or  restoring  of  registers  is  necessary  upon  a  procedure  call  or 
context  switch.  On  the  other  hand,  all  operand  manipulations  require  data  load 
and  store  operations  to  be  performed.  The  above  mentioned  data  1/0 
bottleneck,  and  the  resulting  greater  latency  compared  to  that  of  the  register 
file  architecture,  may  reduce  performance.  An  example  of  a  memory-to- 
memory  m;  proprocessor  is  the  TMS  9995  [4]. 

The  relative  merit  of  the  memory-to-memory  approach  versus  a  register- 
based  machine  depends  on  the  programming  environment  and  memory  perfor¬ 
mance.  Additional  data  traffic  of  the  memory  architecture  must  be  compared 
to  that  incurred  in  a  register  machine  when  the  procedure  nesting  depth  or 
number  of  processes  exceed  the  available  number  of  register  banks,  as  well  as 
that  occurring  when  capacity  of  each  bank  is  exceeded. 

Clearly,  an  infinitely  large  local  memory  is  desirable  since  it  can  reduce 
off-chip  data  traffic  to  zero.  Thus,  the  architects  would  like  to  have  as  much  on- 
chip  memory  as  possible.  However,  if  the  local  memory  is  too  large,  it  will  also 
be  slower,  and  system  performance  will  be  degraded.  This  chapter  analyzes 
these  tradeoffs. 

Local  Memory  in  RISC  II 

The  RISC  n  is  the  second  in  a  series  of  32-bit,  NMOS  microprocessors 
developed  at  U.C.  Berkeley  [5].  The  RISC  instruction  set  consists  solely  of  single 


register-to-register  operations  [6].  This  simple  and  regular  implementation 
reduces  control  complexity,  chip  area,  and  design  time,  and  it  simplifies  imple¬ 
mentation  of  pipelined  execution  [?].  The  simple  RISC  instruction  set  is  an 
easier  target  for  highly  optimizing  compilers  than  is  a  complex  instruction  set 
[9],  Proper  optimization  can  also  reduce  dependency  overhead  inherent  in  pipe¬ 
lined  implementations  [9],  thus  making  more  effective  use  of  the  available  data¬ 
path  bandw'dth. 

A  drawback  of  such  an  instruction  set  is  that  it  requires  higher  memory’ 
bandwidth  for  fetching  these  instructions.  Because  the  instructions  are  simpler, 
it  often  requires  several  of  them  to  synthesize  a  complex  instruction.  This 
increases  overall  code  size.  On  the  other  hand,  the  RISC  microarchitecture 
includes  support  for  subroutine  call  and  return,  one  of  the  most  time-consuming 
operations  in  typical  high-level  language  programs  for  machines  which  keep 
variables  in  registers  [8].  The  number  of  register  saves  and  restores  is  reduced 
by  employing  a  local  memory  organized  as  multiple  register  banks.  A  new  bank 
is  allocated  whenever  a  procedure  is  called.  The  banks  represent  stack  levels, 
so  that  register  save  or  restore  need  be  performed  only  during  stack  overflows 
or  underflows. 

RISC  11  includes  eight  register  banks,  or  window’s,  one  of  which  is  reserved 
for  interrupt  processing.  At  any  one  time  there  are  ten  registers  local  to  the 
present  procedure  level.  Additionally,  there  are  six  "high"  and  six  "low"  regis¬ 
ters  which  are  shared  by  adjacent  procedure  levels;  these  are  used  primarily  for 
passing  parameters  and  results  between  procedures.  Each  window  swap  (for 
save  or  restore)  involves  sixteen  registers:  the  ten  locals  and  one  set  of  over¬ 
laps.  Ten  global  registers  are  accessible  from  any  procedure  level,  thus  a  total 
of  32  registers  are  addressable  from  any  procedure. 
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Table  I  and  Figure  1  show  the  relative  execution  time  and  performance  of 
two  C  programs.  'Tower  of  Hanoi"  and  "Puzzle",  versus  the  number  of  windows  on 
the  chip.  These  results  are  based  on  earlier  studies  of  procedure  behavior  and 
register  file  management  overhead  for  RISC  [10.11].  Both  benchmarks  nest  to  a 
depth  of  twenty.  However,  "Tower"  has  a  very  high  rate  of  procedure  calls  and 
returns  (193)  and  thus  makes  intensive  use  of  the  multiple  windows.  Perfor¬ 
mance  improves  steadily  through  the  use  of.  say,  seven  windows.  "Puzzle”,  on 
the  other  hand,  performs  well  with  only  one  or  two  windows;  it  has  only  0.73  calls 
and  returns. 
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TABLE  1:  Normalized  RISC  II  Execution  Time 
(relative  to  case  of  infinite  windows) 
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Figure  1:  Normalized  Performance  of  RISC  II 
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Typical  programs  have  a  procedure  call  or  return  every  twenty  instructions, 
so  the  benchmarks  shown  here  represent  extremes  [12,13].  In  consideration  of 
limited  chip  resources,  a  careful  analysis  of  the  program  environment  is  desir¬ 
able.  If  few  procedures  are  used,  a  smaller  local  memory  allows  resources  to  be 
utilized  for  performance  improvement  in  other  areas. 

Cost  of  Fixed-Size  Window  Swaps 

The  cost  of  register  window  overflow  is  determined  by  two  factors:  the  over¬ 
head  of  servicing  the  interrupt  caused  by  the  overflow,  and  the  cost  of  the  actual 
data  transfer  between  the  register  file  and  external  memory.  The  RISC  II 
microprocessor  incurs  a  penalty  of  about  thirty  instructions  for  the  window 
overflow/underflow  interrupt  routine.  The  single  I/O  bus  implementation  sup¬ 
ports  one  memory  access  per  cycle,  which  means  that  each  load  or  store  for  the 
window  swap  takes  two  cycles.  With  sixteen  registers  per  window,  a  total  of  32 
cycles  are  required.  Although  each  window  swap  is  costly,  overflow/underflow 
occurs  infrequently  if  there  is  a  sufficient  number  of  windows  on  chip.  However, 
reducing  local  memory  size  below  a  program-dependent  limit  degrades  perfor¬ 
mance  significantly  due  to  this  high  cost  of  swapping.  With  fewer  windows, 
better  swap  interrupt  and  I/O  support  is  crucial. 

Since  each  swap  utilizes  the  same  protocol  (sixteen  adjacent  registers 
swapped  to/from  the  current  window)  better  data  I/O  support  can  be  provided. 
For  example,  a  single  instruction  may  provide  all  the  necessary  information  for 
multiple  register  copying.  Then  it  is  not  necessary  to  fetch  individual  Load  or 
Store  instructions  for  each  register  transfer.  Furthermore,  only  a  starting 
address  needs  to  be  passed  to  the  memory  controller  in  order  to  initiate  the  16- 
word  move.  Two  data  words  may  then  be  passed  on  the  bus  each  machine  cycle. 
Compared  to  the  present  scheme,  with  one  data  word  every  two  cycles  inter¬ 
leaved  with  instruction  fetching,  throughput  is  increased  by  a  factor  of  four. 
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At  compile  time,  the  dynamic  procedure  nesting  profile  is  not  known. 
Therefore  the  compiler  cannot  anticipate  window  overflows  in  a  multiple-window 
implementation.  For  this  reason,  overflows  must  be  detected  on  chip.  It  is  the 
cost  of  handling  this  interrupt  which  accounts  for  thirty  instructions  in  RISC  II. 
For  a  processor  with  a  single  window,  the  compiler  can  anticipate  the  swaps  and 
this  overhead  can  be  reduced.  Every  executed  call  or  return  requires  a  save  or 
restore  operation,  respectively. 

Table  II  presents  RISC  II  execution  time  as  a  function  of  data  I/O  bandwidth 
and  local  memory  size.  The  cost  of  each  swap  includes  the  thirty  cycles  for 
interrupt  overhead,  as  well  as  the  sixteen  data  word  transfers.  Since  swap  over¬ 
head  for  "Puzzle"  is  small,  only  'Tower”  is  considered  here. 


TABLE  II:  Execution  Time  for  ’Tower”  with  Varying  Swap  Bandwidth 
(includes  interrupt  overhead  for  multiple  window  case?) 


Performance  penalty  due  to  interrupt  overhead  is  illustrated  by  the  shaded 
area  in  Figure  2.  As  before,  seven  or  more  windows  are  desirable  for  high  per¬ 
formance,  regardless  of  the  level  oil  data  I/O  support.  This  is  because  the  inter¬ 
rupt  overhead  is  so  high.  An  exception  is  the  single  window  case  with  dual  data 
I/O  per  cycle,  which  is  seen  to  provide  better  performance  than  register  flies 
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Figure  2:  Performance  with  Overhead  of  RISC  11  Swap  Interrupt 
(’Tower"  Benchmark,  half  data  1/0  per  cycle) 

with  two  or  three  windows.  With  a  large  local  memory,  efficient  interrupt  sup¬ 
port  is  not  so  important  since  few  swaps  occur.  With  Tew  windows,  though,  it  can 
be  crucial.  The  remainder  of  this  chapter  will  not  consider  the  interrupt  over¬ 
head;  it  is  assumed  that  the  area  freed  by  reducing  local  memory  size  may  be 
dedicated  toward  better  interrupt  support,  and  that  overall  swapping  cost  is 
dominated  by  the  register  traffic. 

Improved  Swapping  Strategies 

Thus  far,  we  have  only  considered  fixed-size  window  swaps  for  which  all 
registers  are  transferred  to/from  memory.  This  scheme  is  attractive  due  to  its 
simplicity  and  ease  in  providing  higher  swap  bandwidth.  However,  such  a 
scheme  swaps  all  registers  in  the  window,  whether  they  were  used  or  not.  A 
study  of  several  C  programs  has  determined  that  on  the  average  only  four  regis¬ 
ters  are  used  per  procedure  in  RISC  [10).  Therefore,  the  fixed-swap  scheme  per¬ 
forms  four  times  the  number  of  necessary  save  and  restore  data  transfers. 
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Register  Usage  Record 


In  order  to  keep  track  of  the  registers  actually  used,  a  "dirty  bit"  may  be 
employed.  During  each  register  write,  a  bit  is  set  to  indicate  a  register  that 
needs  to  be  saved  when  the  window  gets  swapped.  Swaps  thus  vary  in  length, 
depending  on  the  number  of  bits  set.  The  increased  hardware  complexity  neces¬ 
sary  to  support  such  an  approach,  however,  is  undesirable. 

An  alternative  is  to  utilize  a  single-word  register  usage  mask  for  each  win¬ 
dow.  Each  bit  which  is  set  in  the  mask  indicates  usage  of  a  specific  register.  A 
stack  of  such  masks  must  be  maintained  on-chip  for  resident  windows.  During  a 
window  overflow,  the  appropriate  mask  is  stored  with  the  window  contents.  Addi¬ 
tional  logic  is  required  in  order  to  encode  and  decode  the  mask  and  provide  for 
a  mask  stack. 

A  single  window  implementation  does  not  require  this  hardware.  The  com¬ 
piler  can  insert  code  before  each  call  in  order  to  save  the  registers  used  in  the 
current  window.  Restoring  registers  after  a  return  can  be  done  on  demand  by 
the  compiler.  Since  not  all  registers  may  need  to  be  restored,  further  reduction 
in  I/O  is  anticipated. 

Save  overhead  may  also  be  eliminated  by  performing  a  data  memory  write 
in  parallel  with  all  register  file  writes.  This  "store-through"  scheme  requires  a 
dual-bus  microarchitecture  which  can  fetch  an  instruction  and  perform  a  data 
access  in  each  cycle,  such  as  the  TMS  320  [lA]  or  MIPS  [15]  microprocessors. 
Overall,  swap  overhead  may  then  be  reduced  by  more  than  a  factor  of  eight. 

Variable  Window  Size 

Although  we  have  described  alternative  strategies  for  local  memory 
management,  we  have  not  yet  addressed  effective  use  of  the  on-chip  memory 
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area.  The  above  schemes  reduce  data  traffic  and  off-chip  register  save  space  by 
a  factor  of  four,  but  on-chip  memory  still  attains  only  25%  utilization.  If  the 
register  file  windows  can  vary  in  size,  such  a  waste  of  resources  can  be  avoided. 

For  a  variable-size  window  scheme,  "bank"  and  "window"  no  longer  need  to 
be  synonymous.  The  register  file  may  be  divided  into  fixed-size  banks  for  regu¬ 
lar  and  efficient  swapping.  Several  procedures  may  reside  within  a  single  bank 
of,  say.  18  or  32  registers;  they  may  also  span  bank  boundaries.  This  scheme 
requires  additional  hardware,  in  the  form  of  pointers  for  each  procedure 
domain,  and  an  adder  to  calculate  the  physical  addresses  of  the  registers. 
Further  details  of  variable-size  window  schemes  may  be  found  in  [l]. 
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TABLE  111:  Execution  Time  for  "Tower”  with  various  Swap  Schemes 
(one  data  I/O  per  cycle  assumed  for  all  cases) 

Performance  comparison  of  these  schemes  is  presented  in  Table  111.  All 
cases  assume  single  data  I/O  per  cycle  and  four  registers  per  procedure.  Inter¬ 
rupt  overhead,  which  occurs  for  the  multiple  window  and  variable  size  schemes, 
is  not  included  here.  The  variable  window  scheme  is  assumed  to  utilize  two 
equal-size  banks,  with  total  register  count  being  the  number  of  windows  indi¬ 
cated  in  the  table  times  sixteen.  One  of  these  banks  is  swapped  during  an 
overflow  or  underflow.  Total  number  of  registers  is  sixteen  times  the  number  of 
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windows  indicated  in  Table  III.  Significant  performance  improvement  is 
observed  for  implementations  with  few  windows  using  these  alternative  swap 
schemes.  By  using  more  efficient  swapping  strategies,  high  performance  may  be 
attained  with  less  chip  area  dedicated  to  local  memory. 

Register  File  Delay 

Up  to  this  point,  only  I/O  limited  performance  has  been  discussed.  From 
the  designer's  point  of  view,  attention  should  also  be  focused  on  datapath 
bandwidth.  Especially  for  RISCs,  where  each  execution  cycle  consists  of  a  uni¬ 
form  register-to-register  operation,  the  datapath  cycle  time  determines  max¬ 
imum  system  performance.  The  machine  cycle  of  the  RISC  II  consists  of  a  dual¬ 
port  register  read,  followed  by  an  ALU  or  shift  operation;  these  latter  operations 
overlap  the  register  write  of  the  previous  instruction  and  bitline  precharge  of 
the  next  instruction.  The  machine  cycle  is  then  limited  directly  by  the  register 
file  read- write-precharge  cycle  time. 

The  register  file  cycle  time  increases  with  local  memory  size  and  depends 
on  several  design  parameters.  Read  delay  consists  of  two  components:  wordline 
assertion,  and  bitline  discharge  (Figure  3).  The  wordline,  or  addressing,  delay  is 
proportional  to  the  gate  capacitance  loading  of  the  access  transistor.  This  delay 
then  is  linearly  proportional  to  word  size.  In  technologies  where  the  wordline 
itself  is  the  dominant  resistance  in  addressing  delay,  such  as  with  a  polysilicon 
wordline,  this  delay  follows  the  square  of  word  length.  Periodic  buffering  of  the 
address  (as  for  the  dynamic  ripple  carry  ALU)  is  necessary  to  reduce  this  delay 
to  a  linear  function  of  word  size.  Bitline  discharge  delay  increases  with  the  pro¬ 
duct  of  the  resistance  of  the  wordline  access  transistor  and  the  bitline  loading 
capacitance.  This  delay  then  increases  proportionally  with  memory  word  capa- 
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Figure  3:  Register  File  Read  Delay 
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Figure  4:  Cycle  Delay  versus  Local  Memory  Capacity 


Total  register  delay  is  therefore  expected  to  increase  proportionally  with 
local  memory  size.  However,  some  optimization  is  allowed  by  the  wordline 


31 


access  transistor.  Reducing  its  width  allows  faster  addressing  at  the  cost  of 
increased  bitline  discharge  delay.  Optimal  performance  is  attained  when  both 
delays  are  equal  [16];  this  implies  an  access  transistor  size  (and  hence  register 
cycle  delay)  which  is  proportional  to  the  square  root  of  the  memory  capacity. 
The  write  and  precharge  delays  are  then  affected  similarly.  As  a  result,  data¬ 
path  cycle  time  increases  with  the  square  root  of  word  length  or  word  capacity 
of  the  register  file  (Figure  4).  This  effect  must  be  taken  into  account  to  obtain  a 
more  realistic  performance  estimate  for  various  local  memory  sizes. 
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Register  Cycle  Time 

1.00 
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1.73 
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3.68 
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Register 
Swaps  with 
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4.04 
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2.48 
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Varying  Data 
1/0  per  cycle 

2 

2.52 

1.84 

1.95 

1.90 

2.05 

£24 

Partial 

Register 

1.76 

1.53 

1.68 

1.82 

£02 

£25 

Swaps 

- 

Partial 

Swap  with 
Store-Through 

1.38 

1.38 

1.55 

1.78 

2.01 

£24 

Variable  Size 

with  Half 

2.01 

1.48 

1.52 

- 

- 

- 

Bank  Swaps 

TABLE  IV:  Datapath  Bandwidth  Limited  Execution  Time 
(smaller  local  memory  is  faster;  using  ’Tower”  benchmark) 

Table  IV  includes  the  effect  of  variable  register  cycle  delay.  The  partial 
register  swap  schemes  using  a  single  window  yield  the  best  performance,  only 
rivaled  by  the  variable  size  scheme  with  Its  additional  hardware  complexity.  A 
single  window,  fixed-swap  implementation  with  dual  data  I/O  per  cycle 
approaches  the  performance  of  the  seven  window  RISC  II  with  half  data  1/0  per 
cycle.  Execution  time  for  "Puzzle",  with  little  swap  overhead,  follows  the 
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register  cycle  time  dependence  with  memory  size;  it  executes  nearly  twice  as 


fast  with  one  window  as  it  does  with  seven.  Inclusion  of  register  file  delay  then 
has  a  major  impact  on  the  relative  merits  of  the  swapping  schemes.  Of  course, 
the  smaller  local  memory  implementations  have  higher  datapath  bandwidth,  so 
a  similarly  faster  external  memory  is  required  in  order  to  realize  this  perfor¬ 
mance.  Local  memory  size  is  traded  off  against  memory  bandwidth  require¬ 
ments. 

Chip  design  tradeoffs  in  VLSI  must  be  made  using  both  architectural  and 
circuit  design  considerations.  One  measure  of  the  cost  of  local  memory, 
increased  delay,  yields  two  minimum  execution  time  solutions:  1/0  limited,  and 
datapath  bandwidth  limited.  Because  of  the  limited  number  or  pads  that  can  be 
placed  on  a  chip,  memory  I/O  is  a  severe  bottleneck  in  system  performance. 
For  this  reason,  a  large  local  memory  was  chosen  for  RISC  11.  Presently,  memory 
speed  is  increasing,  making  datapath  bandwidth  a  more  critical  limit  to  system 
performance.  In  the  future,  increased  chip  resources  will  make  possible  a 
greater  local  memory  hierarchy  [17];  I/O  bandwidth  may  then  be  replaced  by 
datapath  bandwidth  as  the  primary  factor  limiting  system  performance. 
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CHAPTER  4: 


DATAPATH  TIMING 


A  critical  issue  for  register-to-register  machines  is  the  register  file  organi¬ 
zation  and  timing.  A  variety  of  bitline  and  wordline  configurations  yields  a  wide 
range  of  datapath  bandwidth.  The  microarchitect  must  determine  what  silicon 
resources  are  required  for  each  approach  and  study  the  tradeoffs  between  the 
various  timing  schemes.  The  purpose  here  is  to  review  a  variety  of  possible 
datapath  timing  schemes  in  order  to  develop  an  intuitive  understanding  of  the 
tradeoffs  involved. 

Various  datapath  designs  are  compared  here  in  terms  of  the  speed  at  which 
register-to-register  (R-R)  operations  can  be  executed.  This  determines  the  per¬ 
formance  limit  of  simple,  register-oriented  machines,  such  as  the  RISC  1  [l], 
MIPS  [2],  and  the  801  [3].  Regular  instruction  cycle  timing  yields  simple,  regu¬ 
lar  implementation  of  the  control  circuitry.  Regularity  of  instruction  execution 
permits  pipelining  without  requiring  complicated  hardware  pipeline  interlocks. 

The  benefits  of  instruction  pipelining  may  also  be  exploited  by  a  micro- 
coded  machine.  Pipelining  favors  the  use  of  regular,  register-to-register 
microoperations,  rather  than  the  use  of  irregular,  but  fast,  microcoded  imple¬ 
mentation  of  the  relatively  few  dynamically  executed,  compiled  complex 
instructions  [4]. 
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When  viewed  from  the  datapath,  the  actual  instruction  coding  is  not  visible. 
Therefore  no  assumptions  regarding  the  machine's  instruction  set  are  made  in 
this  chapter.  Instruction  coding  is  a  higher  level  issue,  for  which  tradeoffs  can 
be  assessed  once  a  programming  environment  has  been  chosen. 

For  simplicity,  overhead  of  I/O  will  be  ignored.  Performance  limits  due  to 
off-chip  communication  were  considered  in  the  previous  chapter.  Program 
counter  logic  will  not  be  considered  explicitly;  it  may  be  encompassed  for  our 
purposes  within  the  domain  of  the  addressable  register  ffie.  Therefore,  ideal 
system  speed,  to  be  discussed  in  this  chapter,  is  determined  directly  by  the 
datapath  bandwidth. 
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The  fundamental  datapath  operations  to  be  performed  during  each 
register-to-register  cycle  are  shown  in  Figure  1.  They  include  reading  two 
operands  from  the  register  file,  performing  an  ALU  or  shifter  operation,  and 
writing  the  result  back  into  the  register  file.  In  NMOS  implementations,  the  read 
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bitlines  must  be  restored  to  a  logic  ”1"  prior  to  reading.  This  is  necessary  if  the 
read  bitline  is  dynamically  precharged,  and/or  the  bitline  is  used  for  both  read¬ 
ing  and  writing  through  the  same  register  cell  port.  This  is  to  ensure  the  read 
value  is  valid,  and  that  the  read  operation  does  not  accidentally  write  into  the 
cell. 

A  critical,  limiting  factor  in  determining  allowable  concurrency  is  the  regis¬ 
ter  file  organization:  three  of  the  four  basic  operations  concern  it.  Various  bit¬ 
line  (bus)  and  wordline  (addressing)  organizations  will  be  considered  in  discuss¬ 
ing  timing  schemes  which  exploit  greater  levels  of  concurrency  than  the  sequen¬ 
tial  example  above. 

Shared  Bitline  Register  Files 


READ 

WRITE 

READ 


Figure  2:  Shared  and  Dedicated  Bitline  2-Port  Cells 

Of  fundamental  importance  in  determining  allowable  concurrency  within  a 
register  file  read-write  cycle  is  the  bitline  arrangement.  A  shared  bitline  organi¬ 
zation  utilizes  the  same  bitlines  for  both  reading  and  writing.  A  dedicated  bitline 
approach  utilizes  separate  bitlines  for  reading  and  writing,  allowing  some  over¬ 
lap  of  these  operations  (Figure  2).  Advantages  of  the  shared  bitline  design 
include  its  economy  of  area,  due  to  fewer  bitlines  and  fewer  transistors.  This 
cell  may  also  be  faster,  since  its  smaller  size  reduces  loading  on  the  wordlines. 
This  helps  make  up  for  the  reduced  level  of  concurrency  attainable  with  fewer 
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bitlines.  Overall  system  timing  for  the  shared  bitline  approach  is  identical  to 
that  in  Figure  1. 
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(c)  with  dual-port  write 
Figure  3:  Shared  Bitline  Register  Cells 

Another  design  choice  exists  for  the  shared  bitline  cell,  leading  to  two 
different  structures.  One  may  use  shared  (Figure  3(a))  or  dedicated  (Figure 
3(b))  wordlines  for  the  read  and  write  operations.  In  the  first  case  there  is  one 
set  of  wordlines,  to  be  used  for  both  reading  and  writing  to  common  ports;  writ¬ 
ing  is  usually  performed  via  complementary  signals  driving  two  ports.  This  is  a 
derivation  of  typical  commercial  Random  Access  Memory  (RAM)  designs.  The 
shared  wordline  approach  leads  to  a  fast  and  compact  cell.  Such  a  static  design 
has  been  implemented  in  the  RISC  II,  which  requires  a  large,  dual-port  register 
file  [5];  its  symmetry  was  crucial  in  attaining  a  compact  layout. 

Care  must  be  taken  to  ensure  that  reading  onto  a  precharged  bus  will  not 
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change  the  state  of  the  shared  wordline  cell.  As  shown  in  Figure  3(a).  each  bit 
cell  inverter  may  be  considered  to  be  transformable  into  a  2-input  NOR  gate  by 
the  addition  of  the  wordline  transistor.  If  the  bus  is  initially  discharged  to 
ground,  wordline  access  will  behave  as  a  NOR  input  and  set  the  cell  state  (write). 
If  the  bus  is  precharged  prior  to  accessing,  reading  will  cause  either  no  change 
(if  both  sides  of  the  wordline  FET  are  high),  or  bus  discharge  will  occur  (cell 
node  grounded).  The  wordline  transistor  in  this  latter  case  acts  as  a  voltage 
divider.  A  narrow  transistor  will  have  a  greater  voltage  drop  across  it  during 
discharging,  allowing  the  cell  state  to  be  maintained.  A  wide  transistor  will 
reduce  the  voltage  drop,  allowing  the  high  logic  level  of  the  bus  to  change  the 
cell  state.  This  read  disturb  problem  then  constrains  the  size  of  the  wordline 
transistors,  limiting  speed  of  bitline  discharge  during  reading. 

Dedicated  wordlines  form  separate  ports  for  reading  and  writing.  A  pseu¬ 
dostatic  design,  as  shown  in  Figure  3(b),  allows  temporary  breaking  of  the  feed¬ 
back  loop  (by  the  refresh  FET).  This  allows  single-port  writing  (with  no  wordline 
FET  voltage  drop)  and  eliminates  read  disturb  by  breaking  the  feedback  loop. 
The  wordline  transistor  size  is  not  constrained  as  before.  In  the  case  of  a  dual- 
port  pseudostatic  cell  with  two  pairs  of  wordlines,  two  locations  in  the  register 
file  may  be  written  simultaneously  (Figure  3(c)).  This  scheme  was  chosen  for 
the  Caltech  OM-2  [6],  as  well  as  the  HP  FOCUS  chip  [7].  In  order  to  maintain 
data  integrity,  the  refresh  transistor  must  be  clocked  periodically.  This  is  usu¬ 
ally  done  every  cycle,  during  the  bitline  restore  phase. 

A  disadvantage  of  this  design  is  its  asymmetry,  due  to  the  refresh  transistor 
and  the  single  inverter  drive  for  reading.  The  first  inverter  is  not  used  for  read¬ 
ing  because  its  gate  can  only  be  charged  to  VDD-VT  by  the  enhancement  pass 
transistors.  Overall  cycle  time  then  must  include  the  worst  case  delay  of 
discharging  both  bitUnes. 
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DELAYED  WRITE 

BITLINE  RESTORE 
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Figure  4:  Timing  of  Delayed  Write  Scheme 

In  order  to  reduce  the  datapath  cycle  from  four  phases  to  three,  the  RISC  II 
increased  the  level  of  pipelining  and  incorporated  a  delayed  write  scheme  [8], 
In  effect,  writing  is  delayed  to  overlap  the  ALU  computation  of  the  following 
instruction.  This  added  level  of  pipelining  is  helpful  as  it  allows  greater  time  for 
interrupts  to  be  detected  without  destroying  the  contents  of  the  register  file. 
Also,  it  the  ALU  delay  is  significant,  it  may  overlap  both  the  register  write  and 
bitline  restore  phases  (Figure  4). 


Figure  5:  Delayed  Write  with  Internal  Forwarding 
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Some  performance  degradation  might  result  from  this  scheme  due  to  data 
dependencies.  The  result  of  a  computation  is  not  available  in  the  register  file  for 
the  read  phase  of  the  following  instruction.  This  is  a  consequence  of  this  pipe¬ 
lined  implementation  with  the  delayed  write.  The  problem  was  solved  in  the 
case  of  RISC  II  by  detecting  data  dependencies  and  forwarding  the  data  through 
a  temporary  register  to  the  ALU  or  shifter.  This  internal  forwarding,  or  chain¬ 
ing  technique  allows  the  data  in  this  register  to  override  the  result  of  the  regis¬ 
ter  file  read  (Figure  5).  This  technique  is  transparent  to  the  programmer  or 
compiler  writer.  Such  an  approach  is  routinely  used  to  increase  performance  of 
highly-pipelined  computers,  such  as  the  CRAY  I. 

Dedicated  Bitline  Register  Files 


(a)  Static  Dedicated  Bitline  Cell  (b)  Pseudostatic  Dedicated  Bitline  Cell 


Figure  6:  Dedicated  Bitline  Cells 

As  previously  mentioned,  the  dedicated  bitline  design  utilizes  separate  bit¬ 
lines  for  reading  and  writing.  Implicitly,  this  requires  separate  wordlines  as  well 
to  guarantee  independence  of  read  and  write  operations  (see  Figure  6).  This 
structure  supports  a  higher  level  of  concurrency  and  therefore  may  be  desirable 
for  a  high-speed  datapath.  Restoring  of  the  bitline  may  overlap  the  writing  of  the 
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Figure  7:  Timing  of  Dedicated  Bitline  Datapath 

cell  since  it  uses  separate  control  and  data  lines.  This  makes  possible  the 
three-phase  timing  of  Figure  7.  Such  an  approach  has  been  used  in  the  RISC  I 
[8]  and  the  Matsushita  MN1613  [9].  This  scheme,  however,  will  be  slower  than 
the  previous  approach  (shared  bitline  with  delayed  writing)  if  the  ALU  delay  is 
greater  than  that  of  the  bitline  restore.  This,  in  conjunction  with  the  cell  area 
difference,  makes  the  three-phase  dedicated  bitline  scheme  discussed  here 
undesirable  for  large  register  flies. 


REGISTER  READ 

ALU  OPERATION 

REGISTER  WRITE 

BITLINE  RESTORE 

Figure  8:  Timing  of  Dedicated  Bitline  with  Delayed  Write 

In  order  to  increase  the  concurrency,  the  delayed-write  scheme  may  be 
used.  Tuning  of  the  ALU  operation,  register  write,  and  the  bitline  restore  all 
overlap  (Figure  8).  Internal  forwarding  logic  is  necessary  to  eliminate  data 
dependency  problems  as  before.  Forwarding  is  performed  in  parallel  with  the 
register  write. 


REGISTER  READ 


ALU  OPERATION 


REGISTER  WRITE  BITLINE  RESTORE 

Figure  9:  Overlapped  Read/Write  Scheme  Timing 

Alternatively,  the  read  and  write  operations  may  be  overlapped,  as  shown  in 
Figure  9.  Dependency  detection  logic  is  again  required,  and  internal  forwarding 
is  performed  in  parallel  with  the  register  write. 


Figure  10:  Delayed  Write  with  Overlapped  Read  Timing 

For  even  higher  performance,  two  sets  of  data  dependency  logic  are 
required.  The  first  forwards  the  result  of  an  ALU  or  shift  operation  to  the  data 
read  register  of  the  following  instruction.  The  second  forwards  this  data,  as  it  is 
being  written  into  the  register  file,  to  the  read  register  for  the  instruction  after 
that.  This  allows  the  greatest  concurrency,  as  shown  in  Figure  10. 

In  order  to  combine  the  register  read  and  bitline  restore  in  a  single  phase, 
the  restore  must  be  initiated  early  enough  during  the  read  phase  so  that  it  over¬ 
laps  the  addressing  delay.  At  the  time  the  read  wordtines  are  driven  to  a  logic 
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"1",  the  bitlines  must  be  precharged  above  the  bit  cell  logic  threshold  in  order 
to  eliminate  writing  into  the  celL  Precharge  may  continue,  overlapping  wordline 
delay  so  that  adequate  noise  margins  are  maintained.  Alternatively,  current 
sensing  may  be  used,  in  which  case  the  bitline  voltage  remains  constant  This 
technique  has  been  utilized  in  MOS  ROMs  £  10]. 

System  throughput  for  this  single-phase  timing  scheme  may  be  quadruple 
that  of  the  original  four-phase  sequential  example  at  the  beginning  of  this 
chapter.  This  performance  increase  is  achieved  by  maximizing  module  usage  in 
each  phase,  in  tune  with  effective  chip  resource  utilization. 

The  treatment  of  datapath  timing  and  register  file  organization  has  been 
very  simplistic  in  this  chapter.  Since  the  bit  cells  must  be  designed  uniquely  for 
each  timing  scheme,  their  area  will  vary.  This  has  a  varying  impact  on  chip 
resources  and  on  cycle  delay  time.  A  more  detailed  analysis  is  thus  required  for 
the  selection  of  an  optimal  register  cell  and  timing  scheme.  This  will  be  dis¬ 
cussed  in  more  detail  in  Chapter  8. 
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CHAPTER  5: 


ALU  DESIGN  TRADEOFFS 

Traditionally,  evaluation  of  different  adder  schemes  has  been  carried  out 
with  the  assumption  of  a  fixed  gate  delay.  Such  a  straightforward  comparison  is 
permitted  by  low  levels  of  integration,  using  SSI  parts.  These  parts  are  designed 
to  accommodate  a  wide  range  of  capacitance  loading  due  to  off-chip  wiring.  As  a 
result,  delay  exhibits  little  dependence  on  the  loading  capacitance  typically 
encountered  [l].  The  designer  calculates  circuit  delay  by  simply  determining 
the  number  of  gates  in  the  critical  path. 

The  custom  nature  of  VLSI,  on  the  other  hand,  gives  the  designer  more  free¬ 
dom  to  optimize  performance.  Dynamic  logic  and  bootstrapping  techniques  can 
be  used  to  increase  performance.  Under  this  variety  of  approaches,  gate  delays 
can  no  longer  be  considered  constant  Comparison  of  adder  performance  based 
on  the  fixed  delay  model  is  inadequate  for  VLSI  implementation. 

This  chapter  will  begin  with  a  review  of  adder  design  strategies.  Initially  a 
gate-level  view  will  be  used  in  order  to  simplify  understanding.  It  is  also  directly 
applicable  to  fixed  gate  delay  analysis.  This  will  be  followed  by  a  discussion  of 
design  in  MMOS  using  dynamic  logic  and  bootstrapping.  Finally,  different  carry 
schemes  will  be  evaluated  for  both  the  fixed  delay  and  NMOS  implementations. 

Adder  Design  at  the  Gate  Level 

An  example  of  a  single-bit  cell  of  a  full  adder  is  shown  in  Figure  1.  Three 
delays  exist:  input  translation,  carry  calculation,  and  sum  generation.  The 


translation  and  sum  delays  are  constant;  they  each  consist  of  a  single  gate 
delay.  The  carry  delay,  however,  is  cumulative  since  its  calculation  is  dependent 
on  the  result  from  the  previous  bit  cell.  The  carry  output  of  the  most  significant 
bit  is  thus  dependent  on  all  the  previous  stages.  Overall  carry  delay  will  vary 
with  the  method  used  for  its  calculation,  as  well  as  with  the  number  of  bits  N\ 
For  this  reason  we  will  focus  on  the  circuitry  that  calculates  the  carry. 

K 

ft 

Figure  1:  Full  Adder  Cell 

Ripple  Carry 

The  simplest  adder  scheme  utilizes  a  ripple  carry  as  shown  in  Figures  1  and 
2.  For  an  N-bit  adder,  the  carry  propagates,  or  ripples,  across  N  stages.  Each 
stage  consists  of  2  gate  delays,  so  the  total  carry  delay  for  an  N-bit  adder  is  2N. 
Advantages  of  this  design  include  minimal  gate  count,  as  well  as  regularity  and 
short  wire  length  for  implementation  in  VLSI. 

The  ripple  carry  approach  is  used  for  small  word  sizes  or  in  applications 
where  speed  is  not  critical.  An  B-bit  ripple  adder  was  chosen  for  the  Intel  90B0 
8-bit  microprocessor  because  the  regular,  compact  layout  had  less  parasitics 
and  was  actually  faster  than  a  lookahead  approach  [2],  The  Motorola  68000  used 
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Figure  2:  Ripple  Adder  Scheme  (2  bits  shown) 

a  16-bit  ripple  adder  because  it  was  found  to  be  faster  than  a  lookahead  adder 
with  the  same  amount  of  power  dissipation  [3].  Ripple  adders  were  also  used  for 
the  16-bit  Caltech  OM-2  [4]  and  the  32-bit  RISC  II  [5]  implementations,  mainly 
because  of  their  small  chip  area  and  short  layout  time. 

Methods  of  reducing  ripple  carry  delay  are  presented  in  [2].  Using  an 
increased  fan-in  of  4.  the  delay  can  be  reduced  to  an  average  of  4  gate  levels  for 
each  3  bit  group  by  propagating  multiple  intermediate  carry  terms  between 
each  stage.  However,  many  more  gates  and  wires  are  required,  and  the  overall 
structure  is  much  less  regular  than  that  of  Figure  2.  For  these  reasons,  such  an 
approach  will  not  be  investigated  further. 
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A  carry-select  (or  conditional  carry)  adder  is  shown  in  Figure  3.  The  carry 
output  of  the  first  M-bit  ripple  adder  is  used  to  select  the  proper  output  of  the 
next  pair  of  ripple  adders,  each  with  complementary  carry  inputs.  All  ripple 
adders  operate  at  the  same  time,  so  overall  delay  consists  of  an  M-bit  ripple  add 
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followed  by  a  cascade  of  multiplexors.  Carry  delay  goes  as  (  U  +  — )  so  there 

exists  an  optimal  M  yielding  a  lower  bound  in  time.  This  choice  of  M  for  highest 
performance  is  'JTJ ,  assuming  bit  delay  is  equal  to  multiplex  delay. 

The  carry  select  adder  is  fully  modular.  Layout  may  be  done  with  a  few 
basic  cells.  No  irregular  wiring  is  required  among  the  modules.  This  is  impor¬ 
tant  in  reducing  the  design  time,  layout  area,  and  the  probability  of  design 
errors  in  the  random  wiring.  Gate  count  (and  therefore  power  and  area)  for 
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carry  calculation  is  nearly  twice  that  of  the  ripple  carry  approach. 
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Figure  4:  Parallel  Adder  (8  bits  shown) 

A  full-lookahead  (or  parallel)  adder  performs  calculation  of  all  P  and  G  pro¬ 
duct  terms.  Figure  4  details  the  organization  of  an  8-bit  parallel  adder  as  well  as 
the  design  of  the  individual  circuit  modules.  The  overall  delay  for  an  N-bit  paral¬ 
lel  adder  goes  as  log/  N  assuming  a  gate  fan-in  of  /.  This  log/  behavior  is  impor¬ 
tant  in  reducing  carry  delay  for  large  adders.  Parallel  adders  have  been  imple¬ 
mented  in  the  HP  Focus  [8],  MIPS  [7]  and  Xerox  Dragon  [8]  32-bit  microproces¬ 
sors. 

Such  an  approach  requires  nearly  four  times  the  gate  count  of  the  ripple 


scheme  for  carry  calculation.  The  associated  increase  in  power  consumption 
and  irregular  wiring  makes  this  design  much  more  costly  for  VLSI  implementa¬ 
tion  than  the  previous  ones.  A  frequently  used  compromise  is  to  do  a  partial 
carry  lookahead,  trading  off  gate  and  power  requirements  against  carry  delay. 
In  this  approach,  lookahead  is  performed  in  M-bit  groups,  the  results  of  which 
are  input  to  M-bit  ripple  adders  as  in  the  carry-select  scheme.  Results  of  this 
partial  lookahead  are  partial  products  which  are  input  to  ripple  carry  adders. 
The  Bellmac-32  [9]  and  the  RISC  I  [10]  32-bit  implementations,  as  well  as  a  pro¬ 
totype  design  by  Siemens  [ll]  utilize  partial  carry  lookahead  adders  with  M=4. 
Another  example  is 'the  partial  carry  lookahead  adder  using  MSI  parts  shown  in 
TI's  TTL  Data  Book  [l].  A  regular  layout  for  lookahead  computation  is  discussed 
in  [12]. 

Adder  Designs  in  NMOS 

So  far  our  model  assumed  fixed  delay  and  power  per  gate.  In  NMOS  the  high 
transistor  impedance  and  the  variety  of  static  and  dynamic  circuit  implementa¬ 
tions  reduces  the  validity  of  such  an  analysis.  Dynamic  logic  requires  no  static 
power  consumption.  Operation  is  performed  in  two-phases:  first  the  output 
nodes  are  dynamically  precharged,  then  they  are  selectively  discharged.  This 
selective  discharge  of  precharged  nodes  without  static  pullups  requires  no  ratio- 
ing  of  the  transistors  as  for  static  logic.  This  allows  transistors  driving  critical 
paths  to  be  freely  increased  in  size,  to  attain  desired  speed. 

The  equivalent  gate  model  for  a  precharged  ripple  carry  chain  is  shown  in 
Figure  5(a).  This  is  similar  to  the  ripple  logic  in  Figure  2.  The  NMOS  dynamic 
ripple  circuitry  is  given  in  Figure  5(b).  During  <p x  the  output  logic  levels  are 
precharged.  At  this  time  the  carry  input  is  not  allowed  to  discharge  the  chain. 
On  fz  th«  carry  input  is  enabled  to  propagate  along  the  ripple  stages  by  selec¬ 
tive  discharge.  The  G  terms  are  also  enobled  to  discharge  the  carry  line. 


^16  the  carry  line  is  precharged;  operation  then  proceeds  with  as  for  the 
previous  example.  Use  of  this  bootstrapping  technique  allows  for  higher  perfor¬ 
mance  of  a  long  ripple  chain. 
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Figure  8;  Transmission  Line  Equivalent  of  Carry  Chain 

A  long  dynamic  carry  chain  will  have  many  pass  transistors  in  series.  As  a 
result,  carry  propagation  across  N  bits  will  be  quite  slow.  Behavior  of  such  a 
carry  chain  is  equivalent  to  that  of  a  transmission  line  (Figure  6).  Assuming  that 
the  pass  transistors  are  of  minimum  channel  length  and  large  enough  to  dom¬ 
inate  carry  line  parasiUcs,  the  transmission  line  delay  becomes  independent  of 
transistor  width.  This  is  because  any  change  in  channel  conductance  (propor¬ 
tional  to  width)  is  accompanied  by  a  proportiona',  change  in  gate  capacitance. 
The  overall  RC  product  remains  constant  for  a  particular  technology. 
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Figure  7;  Carry  Chain  Buffering  (precharge  devices  not  shown) 


One  way  of  overcoming  this  long  carry  chain  delay  is  to  periodically  buffer 
the  carry  line  (Figure  7).  This  is  equivalent  to  the  use  of  repeaters  on  lossy 
transmission  lines.  Overall  delay  for  long  chains  then  becomes  a  linear  function 
of  carry  chain  length,  rather  than  a  square  function  as  would  be  the  case 
without  the  buffers. 


Figure  8:  Buffered  Carry  Chain  Optimization 

Results  of  an  analysis  to  determine  optimal  length  of  chain  sections 
between  buffers  are  shown  in  Figure  8.  Data  are  based  on  SPICE  simulation 
results  using  the  device  parameters  in  Table  I.  The  ratio  of  parasitic  to  gate 
capacitance  was  1:4.  For  the  standard  ripple  carry  design,  four  bits  are  optimal. 
This  value  was  implemented  in  the  Caltech  OM-2  and  the  RISC  I.  The 
bootstrapped  approach  yields  higher  performance  through  eight  bits,  at  which 
point  only  half  the  number  of  buffers  are  required. 
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Capacitances: 

Transistors: 

metal 

0.14 

fF/A2 

0.9 

V 

diffusion  bulk 

0.3 

fF/A2 

Yon 

-3.2 

V 

diffusion  side-wall 

0.3 

fF/A 

Ydd 

5.0 

V 

poly  over  field 

0.2 

fF/A2 

VBB 

-2.0 

V 

gate 

1.6 

fF/A2 

7 

0.75 

yl/2 

gate-src  overlap 

0.5 

fF/A 

k ■ 

20.7 

flA/  V* 

Resistances: 

600 

cm2/  V-sec 

polysilicon 

50 

n/« 

min. 

electr.  .  « 

diffusion 

10 

n/- 

fjjn 

TABLE  I:  NMOS  Device  Parameters  (worst-case  speed) 
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shown  to  be  fastest  with  no  buffering  at  all.  Larger  chains,  though,  may  benefit 
significantly  from  buffering;  it  allows  delay  dependence  on  adder  size  to  be 
reduced  from  N2  to  N  as  seen  by  the  reduction  in  slope  on  the  graph.  The  effect 
of  increased  parasitics  (versus  the  1:4  ratio  mentioned  above)  is  to  make  the 
overall  delay  increase  much  more  quickly  with  chain  length,  so  that  buffering  is 
more  attractive  for  smaller  adders.  Further  analysis  is  necessary  for  optimiza¬ 
tion  in  such  a  case. 

The  performance  of  the  carry-select  approach  using  ripple  adders  can  also 
be  evaluated  using  the  ripple  data.  Each  multiplex  operation  requires  a  single 
buffered  stage  and  precharge  logic,  for  which  delays  may  be  obtained  from  Fig¬ 
ure  8. 
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Figure  10*.  Parallel  Adder  Logic  Stages  in  NMOS 

The  parallel  adder  is  implemented  in  alternating  precharged  and  static 
logic  stages  (Figure  10).  The  logic  functions  shown  are  identical  to  those  in  Fig¬ 
ure  4.  Operation  is  again  two  phase:  first,  precharging  of  the  dynamic  gates, 
then  a  selective  discharge,  driven  by  the  input  terms.  Because  both  polarities  of 
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the  intermediate  P  and  G  product  terms  are  required,  fully  dynamic  logic  chains 
are  not  appropriate.  Delay  of  a  series  of  parallel  adder  stages  is  similar  to  that 
of  ripple  carry  stages  which  are  buffered  every  bit,  as  both  consist  of  a  static 
and  a  dynamic  gate.  The  result  of  the  P  and  G  product  term  calculation  for  all 
bits  must  then  be  processed  in  a  final  stage  to  include  the  carry  input.  Such 
parallel  adder  logic  has  been  implemented  in  the  MIPS  and  Xerox  Dragon  32-bit 
microprocessors. 

Evaluation  of  Carry  Schemes 

Some  adder  strategies  for  representative  microprocessors  are  summarized 
in  Table  II.  It  is  not  clear  which  design  is  best  for  a  given  datapath  size,  although 
small  adders  all  are  of  the  ripple  type.  First  a  comparison  using  the  fixed  gate 
delay  model  is  performed,  as  a  starting  point.  This  is  compared  to  results  of 
NMOS  implementation  using  dynamic  logic  and  bootstrapping  techniques  where 
applicable.  Although  speed  will  be  the  main  focus  of  analysis,  implications  on 
chip  area  and  power  will  also  be  discussed. 


INTEL  8030 

8  Bit  Ripple 

MOTOROLA  60000 

16  Bit  Ripple 

CALTECH  OM-2 

1 6  Bit  Ripple 

UCB  RISC  II 

32  Bit  Ripple 

BELLMAC-32 

32  Bit  Partial  Lookahead 

UCB  RISC  I 

32  Bit  Partial  Lookahead 

HP  FOCUS 

32  Bit  Parallel 

STANFORD  MIPS 

32  Bit  Parallel 

XEROX  DRAGON 

32  Bit  Parallel 

TABLE  11:  Microprocessor  ALU’s 
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The  fixed  gate  delay  model  has  been  applied  to  large  adders  in  order  to 
evaluate  performance  asymptotically.  This  is  not  appropriate  for  comparing 
approaches  for  microprocessors  with  adders  of  only  32  bits  or  less.  This  is  too 
small  for  an  asymptotic  analysis  because  the  constant  components  of  delay  can¬ 
not  be  ignored.  The  data  presented  will  consider  absolute  delays  for  typical  ALU 


Earlier  discussion  of  the  parallel  adder  considered  arbitrary  gate  fan-in.  In 
TTL,  there  is  little  penalty  for  increased  fan-in:  most  of  the  delay  is  attributed 
to  the  output  driver.  In  VLSI,  however,  increased  fan-in  has  its  cost.  For  short 
paths,  the  gate  delay  is  highly  dependent  on  transistor  parasitics  at  the  output 
node.  Increasing  the  number  of  inputs  to  a  NOR  gate  adds  more  drain  diffusion 
and  overlap  parasitics  to  the  output  node.  Because  fixed  device  ratios  must  be 
maintained  for  adequate  noise  margins  in  static  logic,  these  intrinsic  device 
parasitics  cannot  be  eliminated  from  consideration. 


(a)  6-input  NOR  gata 


la n% 


(b)  equivalent  using  2-input  gates 


Figure  11:  Realizations  of  8-Input  NOR  Function 


Delays  of  an  8-input,  static  NOR  gate  were  compared  with  an  equivalent 
realization  composed  of  2-input  gates  (Figure  11).  The  results,  in  Table  III,  are 
based  on  the  device  parameters  given  in  Table  I.  Delay  was  measured  as  the 
time  required  for  the  output  to  reach  3V  and  2V  for  logic  "1"  and  "0”,  respec- 
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Static  NOR 

No  Wire  Delay 

With  Wire  Delay 

Fan-In 

. 

(K=4) 

t<OJi 

tiHl 

tdHL 

8 

(Figure  11(a)) 

15  ns 

3ns 

21ns 

4ns 

2 

(Figure  11(b)) 

18ns 

12ns 

23ns 

16ns 

_ 

TABLE  III:  Gate  Fan-In  Comparison 

tively.  Delay  to  logic  “O"  {t^m)  is  significant  for  the  2-input  version.  However, 
the  interesting  delay  is  that  to  logic  “1"  (t^ m):  this  is  the  limiting  delay. 
Neglecting  wire  loading,  the  smaller  fan-in  incurs  2055  delay  penalty. 

With  wire  delay,  penalty  for  the  2-input  constraint  reduces  to  less  than  1055. 
Wire  loading  was  calculated  based  on  a  42  \  spacing  between  NOR  inputs;  this  is 
the  datapath  pitch  between  each  bit  slice  (as  determined  by  the  register  file)  of 
the  RISC  II  microprocessor  [13].  A  minimum-width  metal  line  was  assumed  to 
connect  the  drains  of  the  multiple  NOR  input  transistors. 

Since  these  resulting  circuit  delays  are  so  similar,  we  will  restrict  ourselves 
to  circuits  using  a  fan-in  of  2  in  the  subsequent  comparison  of  various  carry 
implementations.  This  simplifies  the  analysis  by  reducing  the  number  of  vari¬ 
ables  to  be  considered.  Coincidently,  the  parallel  adders  in  the  MIPS  and  the 
Xerox  Dragon  micrprocessors  both  use  gates  with  a  fan-in  of  2  for  the  carry 
computation. 

Table  IV  summarizes  the  delay  and  gate  count  for  the  fixed  delay  model 
carry  schemes.  Figure  12(a)  depicts  adder  delay  as  a  function  of  N.  while  12(b) 
gives  gate  count  as  a  first  approximation  of  area  and  power  requirements.  Per¬ 
formance  for  the  more  complicated  schemes  is  seen  to  improve  considerably 


Carry  Scheme 

Gate  Delays  for  Carry 

§  Gates  for  Carry 

Full  Ripple 

2N 

2N 

Carry-Select 

4V77  -  2 

4N-2 

Parallel 

41og2iV  -  2 

8N  -  31ogaiV  -  8 

TABLE  IV:  Carry  Computation:  Fixed  Gate  Delay  Model 
(fan-in  =  2,  neglecting  wiring  delays) 
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Figure  22:  Delay  and  Gate  Count  Comparison 

over  that  for  the  ripple  scheme,  though  at  the  cost  of  additional  area  and  power 
requirements.  The  carry-select  design  is  fastest  from  8  to  16  bits;  it  requires 
nearly  twice  the  number  of  gates  as  the  rippie  version.  The  parallel  design  is 


fastest  for  32  bits  and  beyond,  though  at  a  gate  and  power  cost  approaching  four 
times  that  of  the  ripple  carry  scheme. 


TABLE  V:  MOS  Implementation  Comparison 
(where  ripple  chain  length  M  -V^> 


Results  of  the  NMOS  delay  analysis  using  optimal  size  buffered  ripple  carry 
chains  are  summarized  in  Table  V.  These  results  represent  optimal  solutions: 
actual  values  will  differ  slightly  for  the  carry-select  adder  in  order  to  accommo¬ 
date  the  granularity  of  optimal  chain  length.  Data  for  typical  adder  sizes  are 
given  in  Figure  13.  Results  are  based  upon  the  optimally  buffered,  bootstrapped 
carry  chain  length  of  8  bits  with  15ns  of  delay,  and  a  single  buffer  stage  delay  of 
7ns.  Using  these  parameters,  the  ripple  adder  is  best  for  B  bits  and  also  attrac¬ 
tive  for  16  bits,  due  to  its  reduced  area  and  power  requirements.  The  carry- 
select  is  fastest  through  128  bits  and  requires  much  fewer  buffers  than  the 
parallel  design.  In  fact,  for  a  large  adder  the  parallel  approach  requires  nearly 
20  times  as  many  buffers  as  the  carry-select  scheme.  This  high  number  of 
buffers  can  significantly  impact  power  resources  on  the  chip.  Even  in  a  CMOS 
implementation  using  no  static  power,  the  additional  buffers  are  costly  in  terms 
of  silicon  area.  These  results  differ  markedly  from  those  obtained  using  the 


Figure  13:  NMOS  Carry  Logic  Comparison 
fixed  gate  delay  model. 

A  more  accurate  comparison  must  include  the  effect  of  wiring  delays.  The 
ripple  adder  has  the  shortest  paths  and  would  be  least  affected  by  such  delays. 
The  parallel  adder,  in  contrast,  has  wire  lengths  which  increase  with  N;  the  long¬ 
est  connections  must  span  half  the  width  of  the  ALU.  Inclusion  of  such  delays 
could  only  lessen  the  gains  of  increased  parallelism.  This  further  reduces  the 
applicability  of  the  fixed  gate  delay  analysis  and  makes  the  parallel  adder  unat¬ 
tractive  for  VLSI  implementations. 
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CHAPTER  6: 


PROCESSOR  PERFORMANCE 


Previous  chapters  have  discussed  individual  areas  of  design  tradeoffs,  one 
by  one.  In  reality,  there  is  much  more  interaction  among  these  areas  than  has 
been  suggested  so  far.  In  order  to  perform  the  analysis  necessary  to  account 
for  this  interaction,  an  understanding  of  the  entire  processor  design  and  its 
associated  tradeoffs  is  required.  Present  32-bit  microprocessors  include  up  to 
several  hundred  thousand  transistors.  Familiarization  with  every  aspect  of  the 
design  then  can  be  a  monumental  task. 

However,  it  is  not  necessary  to  consider  processor  behavior  all  the  way 
down  to  the  circuit  leveL  Use  of  higher  levels  of  abstraction  allow  some 
tradeoffs  to  be  evaluated  independently  of  circuit  details  or  of  the  fabrication 
technology.  This  yields  good  results  without  overwhelming  the  designer  and 
without  burdening  him  with  unnecessary  detail.  In  some  other  cases,  however, 
the  strengths  and  weaknesses  of  a  particular  implementation  technology  will 
have  an  impact  even  at  the  architectural  level.  For  example,  the  cost  of  imple¬ 
menting  a  particular  function  on  the  chip  may  vary  so  much  among  different 
processing  technologies  that  it  may  become  prohibitive  in  some  instances. 

Limited  chip  area  and  power  resources  make  processor  design  optimization 


a  real  challenge.  A  large  local  memory  will  reduce  the  amount  of  data  1/0 
required  during  execution  at  the  cost  of  chip  resources  otherwise  available  for 
other  functions.  Increasing  the  datapath  wordsize  has  a  similar  effect  Optimal 
local  memory  capacity,  discussed  in  Chapter  3.  may  be  too  costly  to  implement. 
Other  strategies  for  performance  improvement,  such  as  increased  swap  support 
or  processor  pipelining,  must  then  be  considered. 

Increased  pipelining  boosts  processor  throughput  as  discussed  in  Chapters 
2  and  4.  A  side  effect  of  pipelining  in  the  register  file  is  higher  memory  area 
cost,  due  to  the  increase  in  the  number  of  wordlines  and  bitlines  necessary  to 
support  the  concurrent  operations.  As  a  result  less  local  memory  may  be 
implemented  in  a  given  amount  of  chip  area.  This  reduces  the  gain  offered  by 
pipelining  in  the  first  place.  For  fixed  local  memory  capacity,  the  highly  pipe¬ 
lined  implementation  will  have  slower  register  operations.  Increased  pipelining 
can  even  degrade  system  performance. 

DATA.IN  OPCODE 


Figure  1:  Chip  Area  Allocation  in  RISC  II 


Figure  1  shows  area  allocation  for  the  RISC  II  microprocessor  [l].  The 
register  file  occupies  the  majority  of  the  datapath  area:  it  also  consumes  half  of 
the  overall  chip  power.  In  contrast,  the  ALU  occupies  little  area.  Based  on  the 
findings  of  Chapter  5,  it  is  assumed  that  a  sufficiently  fast  ALU  can  be  realized  to 
match  the  register  file  speed.  Because  the  register  file  is  a  limiting  factor  in 
system  performance  and  datapath  bandwidth  (Chapters  3  and  4).  it  remains  the 
focal  point  of  discussion  in  this  chapter. 


System  Timing 

In  Chapter  4,  several  datapath  timing  schemes  were  evaluated.  The  number 
of  sub-phases  in  each  datapath  cycle  were  reduced  by  going  from  a  shared  bit¬ 
line  to  a  dedicated  bitline  configuration.  The  delayed  write  and  overlapped 
read/write  schemes  further  reduce  the  cycle  time.  A  four-fold  speedup  was 
predicted,  as  shown  in  Table  1. 


Table  I:  Datapath  Timing  Schemes 


Overall  system  timing  including  pipelining  is  summarized  in  Table  II.  For  al! 
except  the  four-way  pipeline,  the  datapath  timing  schemes  vary  the  number  of 
subphases,  and  thus  system  performance,  by  a  factor  of  two.  The  overall  range 


of  system  bandwidth  now  doubles  for  an  eight-fold  variation.  Performance  of  the 
sequential  and  two-way  schemes  is  further  affected  by  the  frequency  of  data 
loads  and  stores;  each  incurs  50ff  or  100?!  overhead  per  cycle,  respectively.  For 
an  instruction  mix  including  25Z  data  loads  and  stores,  system  bandwidth  can 
then  vary  ten-fold  among  the  possible  timing  schemes. 


Table  II:  System  Cycle  Time 

Local  Memory  Capacity 

In  Chapter  3,  discussion  focused  on  optimal  local  memory  size.  Chip  area 
was  assumed  sufficient  to  allow  its  implementation.  In  the  real  world,  this  may 
not  be  a  valid  assumption.  The  limited  chip  area  must  be  shared  among  many 
functions.  System  architecture  and  microarchitecture  both  play  important 
parts  in  determining  how  much  room  is  available  for  local  memory.  A  non-ideal 
local  memory  capacity  impairs  performance;  some  improvement  may  be 
achieved  with  increased  swap  support  or  pipelining. 
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Pipelining 

Bitline 

Number  of 

Number  of 

Area 

Scheme 

Configuration 

Bitlines 

Wordlines 

Factor 

Sequential 

Shared 

2 

2 

4 

Dedicated 

3 

4 

12 

Two-Way 

Shared 

2 

2 

— H 

4 

Dedicated 

3 

4 

12 

Three-Way 

Shared 

2 

4 

8 

Shved 

4 

2 

8 

Dedicated 

4 

4 

16 

Four-Way 

Dedicated 

4 

4 

16 

Table  III:  Bit  Cell  Area  Variation 

Increased  datapath  pipelining  generally  requires  a  larger  bit  cell  (Chapter 
4).  An  effect  or  pipelining,  then,  is  a  reduction  in  the  amount  of  local  memory 
which  can  be  realized  in  fixed  chip  area.  Table  III  presents  the  number  of  bit¬ 
lines  and  wordlines  required  for  various  levels  of  pipelining.  A  simple  estimate  of 
relative  cell  size  may  be  obtained  by  the  product  of  the  number  of  bitlines  (hor¬ 
izontal)  and  wordlines  (vertical)  passing  through  the  cell.  In  the  case  of  a  pseu¬ 
dostatic  bitcell  design,  the  refresh  line  is  added  to  the  wordline  count.  This 
model  yields  a  four-fold  variation  in  bit  cell  size.  The  degree  of  pipelining  then 
significantly  affects  the  amount  of  local  memory  attainable  on  chip. 

The  RISC  I,  with  a  pseudostatic,  dedicated  bitline  cell  incurs  an  area  factor 
of  12.  The  RISC  II  utilizes  a  static,  shared  billine  and  wordline  cell;  its  area  fac¬ 
tor  is  only  4.  This  factor  of  three  difference  in  bitcell  size  predicted  by  our  sim¬ 
ple  area  model  closely  matches  actual  silicon  implementation  (2,733.5  versus 
924  A2  per  bit). 
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Local  memory  capacity  is  also  limited  by  allowable  power  dissipation. 
Power  for  a  static  register  file  is  determined  by  the  number  of  registers.  For 
each  static  register,  one  inverter  of  the  pair  maintains  a  constant  current  to 
ground.  There  is  little  or  no  additional  power  required  for  increased  pipelining. 
If  the  optimal  local  memory  size  cannot  be  achieved  due  to  power  limitations, 
then  pipelining  may  need  to  be  increased  for  higher  performance. 

Static  bit  cell  power  consumption  (in  NMOS)  may  be  reduced  by  lengthening 
the  depletion  load  transistors.  This  increases  cell  area.  The  long  depletion  loads 
in  the  RISC  II  register  file  lengthened  the  cell  sufficiently  to  admit  four  bitlines 
without  additional  area  penalty;  however,  only  two  were  used  by  the  two-way 
pipelined  datapath  using  dedicated  bitlines  [2]. 

Power  dissipation  may  be  reduced  without  an  area  penalty,  using  high- 
resistance  polysilicon  loads  or  dynamic  storage.  These  strategies  can  increase 
register  cycle  time  (due  to  longer  write  and  restore  delays),  as  well  as  the  sus¬ 
ceptibility  to  soft  errors  induced  by  alpha  particles. 

Complementary-MOS  (CMOS)  is  attractive  due  to  its  extremely  low  static 
power  dissipation,  which  is  determined  only  by  leakage  currents.  In  the  past, 
CMOS  was  used  primarily  in  specialized,  low-power  applications,  such  as  in  digi¬ 
tal  watches  and  other  battery-operated  products.  The  additional  area  required 
for  "wells"  or  "tubs"  needed  to  accommodate  the  complementary  devices  made 
CMOS  too  expensive  to  compete  with  the  (then  simpler)  NMOS  process.  The 
resulting  emphasis  placed  on  NMOS  technology  further  widened  the  gap  in  per¬ 
formance  between  these  two  technologies.  At  present,  however,  NMOS  chips 
have  reached  their  limit  in  allowable  power  dissipation.  A  great  deal  of  attention 
is  now  being  focused  on  CMOS  process  development;  it  is  emerging  as  the  pri¬ 
mary  candidate  for  exploiting  higher  device  densities. 
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Datapath  Bandwidth 

Datapath  bandwidth  has  been  discussed  in  terms  of  phases,  assuming  fixed 
phase  length.  Processor  cycle  times  using  this  assumption  were  presented  in 
Table  II.  In  Chapter  3,  register  cycle  time  was  considered  to  grow  with  the 
square-root  of  local  memory  capacity.  Since  the  register  delay  makes  up  most 
of  the  cycle  time,  processor  bandwidth  decreases  with  enlarged  memory  capa¬ 
city.  System  performance  was  reevaluated  to  include  this  effect. 

Depending  on  the  bitline  configuration  and  level  of  pipelining  employed,  bit 
cell  area  was  shown  to  vary  (Table  III).  A  larger  cell  requires  longer  bitlines 
and/or  wordlines.  This  leads  to  increased  delay,  which  follows  the  square  root  of 
cell  area  in  a  manner  similar  to  that  discussed  in  Chapter  3. 


Pipelining 

Datapath 

Bitline 

Delay 

Relative 

Scheme 

Sub-Fhases 

Configuration 

Factor 

Cycle  Time 

4^ 

Shared 

2 

16 

Sequential 

3  <P 

Shared 

2 

12 

2<p 

Dedicated 

VIS 

13.0 

4?9 

Shared 

2 

0 

Two-Way 

3 <P 

Shared 

2 

6 

Dedicated 

VT2 

6.9 

4 <P 

Shared 

11.28 

Three-Way 

3 tp 

Shared 

vg 

8.46 

2 V 

Dedicated 

4 

0 

Four-Way 

1 V 

Dedicated 

4 

4 

Table  IY:  Processor  Cycle  Times 


The  relative  cycle  time  for  various  pipelining  and  datapath  timing  schemes 
is  given  in  Table  IV.  Results  are  based  on  the  number  of  sub-phases  per  cycle 
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(Table  11)  times  the  square  root  of  the  bit  cell  area  factor  (Table  III).  Whereas 
Table  II  predicted  an  eightfold  range  in  datapath  bandwidth,  the  new  results 
shows  only  half  of  this  is  actually  achieved.  (These  results  assume  a  constant 
memory  capacity:  where  several  bitcell  entries  yield  the  same  number  of  data¬ 
path  sub-phases,  the  smaller  cell  was  chosen). 

Both  the  RISC  I  and  RISC  II  implementations  utilized  two  levels  of  system 
pipelining.  The  RISC  I,  with  its  large  bitcell  and  3?  datapath  timing  scheme,  has 
a  relative  cycle  time  of  10.4  using  our  delay  model.  The  smaller  bitcell  of  RISC  II 
allowed  a  relative  cycle  time  of  only  B.0,  despite  a  4?  datapath  cycle.  Com¬ 
parison  of  actual  datapath  bandwidth  for  the  fabricated  chips  was  not  possible, 
due  to  design  errors  in  the  control  logic  of  RISC  I  which  did  not  allow  the  full 
datapath  bandwidth  to  be  attained. 

Data  Wordsize 

A  four-bit  microprocessor  may  suffice  for  a  microwave  oven  controller:  little 
memory  is  addressed,  operations  are  simple,  and  high  speed  is  not  required.  A 
large  microprocessor  will  do  the  job  more  than  adequately,  but  it  will  not  be 
cost-effective.  Other  applications  such  as  number  crunching  of  massive 
amounts  of  data  pertaining  to  seismic  exploration  or  weather  observation 
require  much  more  processing  power.  These  scientific  calculations  using  single 
and  double  precision  floating  point  data  require  64  and  126  bit  wordsizes  in  con¬ 
junction  with  high  processor  bandwidth.  Despite  this  wide  range  in  wordsize, 
these  applications  all  have  one  thing  in  common:  specialization.  Processor 
wordsize  is  determined  unambiguously  by  the  application. 

In  contrast,  a  time-shared,  High-Level  Language  (HLL)  programming 
environment  supports  a  wide  variety  of  uses.  Data  wordsize  distribution 
includes  8-blt  ASGI  characters  through  123-bit  extended  precision  floating  point 
numbers.  In  such  an  application,  the  choice  of  processor  wordsize  introduces 
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some  interesting  tradeoffs. 

A  wide  datapath  can  execute  in  a  single  cycle  operations  which  would  other* 
wise  require  several  cycles.  However,  the  wide  datapath  requires  proportionally 
more  area  and  power  resources.  For  a  design  where  the  datapath  dominates 
chip  area,  such  as  the  RISC  II,  this  cost  is  significant.  The  wider  datapath  is  also 
slower.  Local  memory  cycle  time  (Chapter  3)  as  well  as  ALU  delay  (Chapter  5: 
conditional  carry  scheme)  both  increase  with  the  square  root  of  wordsize. 
Depending  on  the  application,  then,  a  wider  datapath  may  or  may  not  offer 
improved  performance. 

Today,  typical,  time-shared  HLL  systems  utilize  32-bit  processors.  This 
allows  up  to  4  Gigabytes  of  memory  to  be  addressed,  which  is  sufficient  for  most 
applications.  Complex  arithmetic  operations,  such  as  multiplication,  may  be 
best  performed  by  a  co-processor  on  the  system  bus.  Such  a  co-processor  may 
even  handle  larger  word  sizes  than  the  CPU  itself. 

Assuming  the  CPU  is  required  to  work  with  words  of  32  bits  or  less,  it  is 
interesting  to  observe  the  effect  of  reducing  the  datapath  width  to  16  bits. 
Bandwidth  increases  by  41?!  due  to  the  smaller  ALU  and  local  memory  size. 
Overall  performance  then  will  be  improved  if  less  than  29?!  of  the  instructions 
require  double  cycles  for  32-bit  data. 

These  multiple  cycles  include  32-bit  data  loads  and  stores  as  well  as  ALU 
and  shift  operations.  Additionally,  each  program  branch  (jump.  call,  return) 
may  modify  the  upper  half  of  the  32-bit  address.  This  also  requires  an  additional 
cycle.  Extra  cycles  due  to  normal  program  counter  incrementing  may  be 
neglected,  since  they  are  very  infrequent. 

An  additional  incentive  for  reducing  wordsize  is  that  more  functionality  can 
be  added  using  the  resources  made  available.  Increased  local  memory  capacity, 
better  swap  support,  and  more  specialized  ALU  functions  (such  as  2-bit  multiply) 
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can  be  added  to  further  increase  performance. 

In  practice,  the  "dead  time"  between  clock  phases  as  well  as  the  clock 
delays  themselves  reduce  the  16-bit  speed  improvement.  Also,  more  registers 
will  be  utilized  in  the  16-bit  processor,  since  each  32-bit  datum  requires  two 
registers.  Although  this  may  not  justify  increased  window  size  since  only  five  or 
less  registers  are  typically  used  per  procedure  [3],  swapping  overhead  will 
increase  for  the  partial  swap  scheme.  A  more  detailed  examination  of  data 
lengths  for  the  particular  application  is  necessary  in  order  to  evaluate  the 
impact  of  reduced  wordsize. 

Designing  for  Limited  Chip  Resources 

As  we  have  seen,  the  local  memory  occupies  the  largest  part  of  the  data¬ 
path  area  on  the  chip  and  it  is  the  most  critical  component  determining  proces¬ 
sor  bandwidth.  In  designing  for  limited  area,  realizable  local  memory  capacity 
is  reduced  with  pipelining.  Local  memory  capacity  may  also  be  limited  by  max¬ 
imum  die  size  in  one  dimension,  which  sets  a  limit  on  the  overall  length  of  the 
datapath.  In  Figure  1,  the  RISC  II  local  memory  size  was  restricted  in  the 
number  of  registers  by  the  maximum  mask  pattern  size  and  package  cavity. 
For  that  design,  the  critical  chip  cost  due  to  pipelining  is  attributed  to  the 
number  of  wordlines. 


Table  V  compares  area  and  length  costs  per  unit  bandwidth  for  fixed  capa¬ 
city  local  memory.  These  costs  are  given  in  terms  of  area-  and  length-delay  pro¬ 
ducts  in  order  to  account  for  processor  bandwidth  variation;  they  are  obtained 
by  multiplying  the  cycle  time  (Table  IV)  by  the  area  or  length  factor  (Table  III). 
The  highest  performance  return  for  a  given  amount  of  area  or  length  is  seen  to 
occur  for  the  two-way  pipelined,  delayed-write  scheme  with  shared  bitlines.  This 
is  similar  to  the  approach  used  in  the  RISC  II  microprocessor  [4]. 
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Pipelining 

— 

Datapath 

Bitline 

Area-Delay 

Length-Delay 

Scheme 

Sub-Phases 

Configuration 

Product 

Product 

4? 

Shared 

32 

Sequential 

3  <P 

Shared 

48 

24 

2  9 

Dedicated 

166 

55.4 

4a 

Shared 

32 

16 

Two-Way 

3? 

Shared 

24 

12 

2  9 

Dedicated 

83 

27.7 

4  <p 

Shared 

90 

23 

Three-Way 

3? 

Shared 

68 

17 

2  <P 

Dedicated 

128 

32 

Four-Way 

1  f 

Dedicated 

64 

16 

Table  V:  Relative  Chip  Area  and  Length  Costs  per  Unit  Bandwidth 
(costs  given  as  Area-  and  Length-Delay  products  to  reflect  performance) 
Figure  2  presents  the  "Tower  of  Hanoi”  benchmark  data  for  the  fixed-swap 
scheme  from  Chapter  3.  Performance  is  compared  among  the  pipelining 
schemes  of  Table  V,  using  the  best  implementation  for  each  level  of  pipelining. 
Relative  performance  is  plotted  as  a  function  of  chip  area,  in  units  of  the  area 
factor  times  the  number  of  local  memory  windows.  Since  one  window  in  RISC  is 
reserved  for  global  data,  entries  begin  at  two  windows. 

In  accordance  with  the  area-delay  product  in  Table  V,  the  2-way  pipelined 
implementation  {§ 2  in  the  figure)  yields  the  best  performance  with  little  chip 
area.  It  even  outperforms  the  three-way  version  (#3)  over  the  entire  range.  The 
four-way  version  (#4)  is  not  better  until  three  or  four  times  the  area  is  available; 
maximum  performance  improvement  over  the  two-way  implementation  is  about 
50%. 
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Figure  3  presents  data  using  the  more  efficient  partial  swap  method  of  local 
memory  management,  which  was  seen  to  perform  the  best  in  Chapter  3.  With 
such  a  reduction  in  swap  overhead,  performance  now  degrades  noticeably  as  the 
local  memory  capacity  is  increased,  due  to  the  increase  in  register  cycle  time. 
Overall  performance  improves  by  50/S  versus  the  fixed  swap  scheme,  while 
requiring  only  two  windows  (less  than  a  third  of  the  capacity  required  for  the 
fixed  swap  scheme)  for  peak  performance.  As  before,  the  four-way  version 
yields  50%  better  throughput  than  the  two-way,  at  a  cost  of  three  times  the 
register  area.  The  two-way  pipeline  remains  superior  to  the  three-way  pipeline. 


Figure  4:  Pipelined  System  Performance  Versus  Memory  Length 
(RISC  11  executing  'Tower  of  Hanoi,"  fixed  swaps,  two  cycles  per  register) 

Figures  4  and  5  present  similar  results,  this  time  in  terms  of  the  chip  length 
constraint.  Performance  is  given  versus  the  number  of  bit  cell  wordlines  times 
the  number  of  windows.  Relative  performance  of  the  pipelined  schemes  is  simi¬ 
lar  to  that  in  Figures  2  and  3.  However,  the  variation  in  length  cost  is  not  as 
dramatic  as  that  for  area:  only  a  factor  of  two  in  length  separates  the  optimal 
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As  we  have  seen,  processor  design  optimization  in  VLSI  is  a  complex  task, 
which  must  account  for  the  limited  resources  available  on  a  chip.  The  microar¬ 
chitect  must  not  only  be  familiar  with  the  limitations  of  the  integrated  circuit 
technology  available;  he  must  also  have  some  knowledge  of  the  demands  of  the 
programming  environment  for  which  the  processor  is  designed.  This  is  truly  a 
great  challenge  for  the  microarchitect. 
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CHAPTER  7: 


CONCLUSIONS 


There  are  many  tradeoffs  to  be  considered  in  the  design  of  a  microproces¬ 
sor.  Often,  these  tradeoffs  are  interrelated  and  thus  increase  the  difficulty  of 
the  task  of  the  chip  designer.  In  order  to  simplify  understanding  of  these  issues, 
this  •work  has  first  present  individual  design  areas  in  •which  tradeoffs  can  be 
made.  Each  of  these  design  areas  has  been  discussed  individually  in  order  to 
clarify  the  range  of  choices  and  their  associated  costs.  Later,  overall  chip 
design  was  viewed  with  reference  to  all  of  these  design  tradeoffs  combined  in 
different  ways. 

In  Chapter  1  the  special  constraints  of  VLSI  single-chip  processors  were 
introduced.  The  high  cost  of  custom  design  favors  a  simple  and  regular  imple¬ 
mentation.  The  RISC  architecture  addresses  these  issues  by  simplifying  the 
instruction  set  and  thereby  reducing  the  control  logic  on  the  chip.  This  not  only 
frees  up  valuable  chip  area,  but  also  reduces  design  time  significantly.  The 
notion  of  limited  chip  resources  (area,  pins,  power)  sets  the  context  for  the  rest 
of  the  paper.  Attention  is  focused  on  the  datapath  itself,  since  it  dominates  chip 
area  in  RISC  implementations,  and  its  performance  limits  overall  system  speed. 

System  pipelining  was  investigated  in  Chapter  2.  With  careful  design  of  the 


datapath,  pipelining  may  produce  significant  performance  gains.  As  the  degree 
to  which  pipelining  is  exploited  is  increased,  howe-  er.  data  and  jump  dependen¬ 
cies  make  it  more  difficult  to  attain  further  speedup.  As  a  result,  a  careful 
study  of  program  behavior  is  necessary  in  order  to  accurately  assess  the  value 
of  various  levels  of  pipelining.  At  some  point,  limited  chip  resources  are  better 
utilized  to  speed  up  other  critical  paths  in  the  system  rather  than  to  support 
added  pipelining. 

Local  memory  tradeoffs  were  discussed  in  Chapter  3.  A  fundamental  limit 
to  performance  exists  due  to  the  memory  I/O  traffic  alone.  Data  memory  traffic 
can  be  significantly  reduced  through  the  use  of  an  on-chip  local  memory  organ¬ 
ized  in  multiple  banks.  Careful  study  is  necessary  in  order  to  determine  the 
ideal  size  of  this  local  memory.  A  large  local  memory  reduces  datapath 
bandwidth  and  consumes  resources  available  for  other  functions:  too  small  a 
local  memory  will  result  in  a  processor  that  is  restricted  by  data  I/O.  In  some 
cases,  however,  more  sophisticated  hardware  support  for  local  memory'  manage¬ 
ment  can  compensate  for  this  performance  loss.  Local  memory  design  was  ? 
critical  factor  in  optimizing  the  performance  of  the  RISC  microprocessors. 

Datapath  timing  for  register-based  machines  was  examined  in  Chapter  4. 
Several  schemes  were  presented  in  order  to  reduce  the  number  of  required 
clock  phases  in  each  datapath  cycle.  The  corresponding  increase  in  con¬ 
currency  requires  different  register  bitcell  designs.  In  some  cases,  additional 
circuitry  is  needed  in  order  to  eliminate  data  dependencies  within  the  datapath 
itself. 

Design  tradeoffs  for  the  ALU  were  discussed  in  Chapter  5.  Several  adder 
schemes  were  compared:  ripple,  carry-select,  and  parallel.  An  initial  analysis 
was  performed  based  on  the  assumption  of  fixed  gate  delay,  which  is  applicable 
to  TTL-based  implementations.  Next,  results  of  NMOS  circuit  simulations  were 
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utilized  to  obtain  a  more  realistic  comparison  of  these  schemes  for  VLSI  imple¬ 
mentation.  Because  dynamic  logic  and  bootstrap  techniques  are  available  in 
NMOS  technology,  these  results  differ  significantly  from  those  obtained  with  the 
fixed  gate  delay  model.  The  NMOS  ripple  carry  performed  best  at  B  bits,  while 
the  carry-select  was  optimal  through  129  bits.  The  parallel  adder  was  deter¬ 
mined  to  be  undesirable  for  VLSI  implementation  because  of  its  large  area  and 
power  consumption.  This  contrasts  with  the  TTL-based  results,  where  the  paral¬ 
lel  adder  is  most  attractive. 

In  Chapter  6,  all  of  the  previous  design  areas  were  considered  together  in 
order  to  evaluate  overall  processor  performance  under  the  constraint  of  limited 
chip  resources.  Higher  levels  of  pipelining  were  found  to  be  of  diminishing 
return;  the  bit  cells  needed  to  support  increased  concurrency  reduce  the 
bandwidth  of  the  datapath.  The  two-way  pipelined  system  with  a  register  file 
using  shared  bitlines  and  a  delayed  write  scheme  was  found  to  make  the  most 
efficient  use  of  limited  local  memory  area.  Such  a  design  was  utilized  in  the 
RISC  11  microprocessor. 

Each  system  application  requires  its  own  analysis  for  optimization  of  perfor¬ 
mance.  Additionally,  design  decisions  must  be  continually  reassessed  as  the 
available  chip  resources  and  constraints  change  with  improvements  in  technol¬ 
ogy.  It  is  hoped  that  the  ideas  brought  out  in  this  paper  will  address  the  nature 
of  critical  processor  design  tradeoffs  and  will  prove  useful  to  other  designers 
faced  with  the  task  of  fitting  a  high-performance  processor  onto  a  single  VLSI 
chip. 
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ABSTRACT 

Performance  of  the  RISC  11  microprocessor  is  inves¬ 
tigated  as  a  function  of  local  memory  size.  A  larger  local 
memory  effectively  reduces  data  1/0  traffic  arising  from 
procedure  calls  and  returns.  Datapath  bandwidth,  how¬ 
ever,  is  also  reduced  due  to  the  increased  register  cycle 
time.  A  conflict  then  exists  between  the  desire  for  max¬ 
imum  datapath  bandwidth,  and  the  need  to  reduce  data 
1/0  overhead. 

Since  the  local  memory  occupies  a  significant  por¬ 
tion  of  the  chip,  it  is  important  to  ensure  effective  usage 
of  limited  silicon  resources.  Programming  environments 
which  include  many  nested  procedures  benefit 
significantly  from  a  multiple-bank  local  memory  scheme. 
On  the  other  hand,  those  with  few  procedures  may  suffer 
from  the  increased  register  cycle  time.  In  addition  to 
the  programming  environment,  the  register-bank  swap¬ 
ping  strategy,  bank  overflow  interrupt  overhead,  and 
swap  1/0  support  affect  optimal  memory  size. 


Introduction 

Traditionally,  decisions  in  computer  design 
have  been  based  in  two  different  camps:  archi¬ 
tecture  and  implementation.  Architecture  con¬ 
cerns  the  design  as  it  is  reflected  in  the 
instruction  set:  instruction  and  data  word 
sizes,  data  types,  and  addressing  modes. 
Implementation  concerns  the  microarchitec¬ 
ture:  system  partitioning,  placement,  com¬ 
munication  and  timing;  and,  at  a  lower  level, 
circuit  design  and  device  technology.  Low  lev¬ 
els  of  integration  have  permitted  a  wide  seman¬ 
tic  gap  to  exist  between  the  architect  and  the 
chip  designer:  the  limited  repertoire  of 
SSI/MSI  chips  offered  little  input  for  architec¬ 
tural  influences 

Enter  LSI  and  VLSI,  and  we  suddenly  find 
ourselves  confronted  with  the  prospect  of 
integrating  an  entire  system  on  a  few  chips. 
Each  chip  can  now  take  on  architectural  per¬ 


sonality.  Thus,  the  architecture- 
implejnentation  semantic  gap  is  reduced.  What 
this  means  is  that  more  direct  interdependen¬ 
cies  exist  between  the  two  levels.  The  microar¬ 
chitect  must  now  evaluate  architectural  deci¬ 
sions  in  the  context  of  limited  chip  resources. 

Performance  of  a  given  architecture  is  lim¬ 
ited  by  memory  1/0  and  datapath  cycle  times. 
From  the  architect's  point  of  view  performance 
is  I/O  limited,  and  datapath  delay  can  be 
neglected.  Conversely,  the  designer  considers 
the  datapath  bandwidth  to  be  the  performance 
limit,  with  memory  I/O  neglectable.  Thus,  two 
solutions  may  exist  for  design  of  a  particular 
system.  The  goal  of  this  paper  is  to  examine 
these  solutions  in  the  context  of  the  local 
memory  scheme  used  in  RISC  II. 

Ideal  Performance  of  RISC  II 

The  RISC  II  is  the  second  in  a  series  of  32- 
bit,  NM0S  microprocessors  developed  at  U.C. 
Berkeley  [l].  It  runs  at  500ns  per  cycle,  using 
an  8Mhz  clock  rate,  and  was  fully  functional  in 
its  first  silicon  run.  The  RISC  instruction  set 
consists  solely  of  single  register-to-register 
operations  [2].  This  simple  and  regular  imple¬ 
mentation  reduces  control  complexity,  chip 
area,  and  design  time  while  supporting  pipe¬ 
lined  execution  [3].  The  simple  RISC  instruc¬ 
tion  set  is  a  better  match  to  highly  optimizing 
compilers  than  complex  instruction  sets  [4]. 
Such  optimization  can  also  reduce  dependency 
overhead  inherent  in  pipelined  implementa¬ 
tions  [5],  allowing  more  effective  use  of  avail¬ 
able  datapath  bandwidth. 

A  drawback  of  such  an  instruction  set  is  the 
high  I/O  bandwidth  required  for  external 
memory.  In  order  to  reduce  this  penalty,  the 
RISC  microarchitecture  includes  support  for 
subroutine  call  and  return,  one  of  the  most 
time-consuming  operations  in  typical  high-level 
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language  programs  [6],  Register  saves  and 
restores  are  reduced  by  employing  a  multiple- 
bank  register  file,  or  local  memory.  The  banks 
are  organized  as  stack  levels,  so  that  register 
save  or  restore  need  be  performed  only  during 
stack  overflows  or  underflows. 

RISC  II  includes  eight  register  banks,  or 
windows,  one  of  which  is  reserved  for  interrupt 
processing.  At  any  one  time  there  are  ten 
registers  local  to  the  present  procedure  level. 
Additionally,  there  are  six  each  high  and  low 
registers  which  overlap  adjacent  procedure  lev¬ 
els;  these  are  used  primarily  for  parameter 
passing.  Each  window  swap  (for  save  or 
restore)  involves  sixteen  registers:  the  ten 
locals  and  one  set  of  overlaps.  Ten  global 
registers  are  accessible  from  any  procedure 
level,  forming  a  total  of  32  addressable  regis¬ 
ters. 

Table  I  shows  the  relative  performance  of 
two  C  programs  as  the  number  of  windows  on 
the  chip  varies.  These  results  are  based  on  ear¬ 
lier  studies  of  procedure  behavior  and  register 
file  management  overhead  for  RISC[7,8],  Both 
benchmarks.  Tower  and  Puzzle,  nest  to  a  depth 
of  twenty.  However.  Tower  has  significantly 
more  procedure  calls  (19%  versus  0.7%  for  Puz¬ 
zle),  and  makes  much  more  intensive  use  of  the 
multiple  windows.  Performance  in  such  a  pro¬ 
gramming  environment  improves  considerably 
through  the  use  of,  say,  seven  windows.  Puzzle, 
on  the  other  hand,  performs  well  with  only  two 
windows. 


WINDOWS 

1 

2 

3 

5 

7 

9 

m 

TOWER 

19%  Dynamic 
Call  ic  Return 

7.08 

3.02 

i52 

1.30 

1.10 

1.02 

1.00 

PUZZLE 
0.7%  Dynamic 
Call  k  Return 

1.17 

1.02 

1.00 

1.00 

1.00 

1.00 

1.00 

TABLE  I:  Normalized  RISC  0  Execution  Time 
(relative  to  case  of  infinite  windows) 

Typical  programs  have  a  procedure  call  or 
return  every  twenty  instructions,  so  the  bench¬ 
marks  shown  here  represent  extremes  [9].  In 
consideration  of  limited  chip  resources,  it  may 
be  attractive  to  tailor  the  local  memory  size  to 
specific  environments.  This  allows  resources  to 
be  freed  for  performance  improvement  in  other 
areas. 


Cost  of  Fixed-Size  Window  Swaps 

The  cost  of  register  window  overflow  is 
determined  by  two  factors:  the  overhead  of  ser¬ 
vicing  an  interrupt,  and  the  data  1/0 
bandwidth.  The  RISC  II  microprocessor  incurs  a 
penalty  of  about  thirty  instructions  for  the  win¬ 
dow  overflow /underflow  interrupt  routine,  in 
addition  to  the  register  swap  cost.  The  single 
I/O  bus  implementation  supports  one  I/O 
operation  per  cycle,  meaning  that  each  Load  or 
Store  takes  two  cycles.  Although  each  window 
swap.is  costly,  there  is  a  sufficient  number  of 
windows  on  chip  so  that  the  swaps  are  few. 
Reducing  the  local  memory  size  degrades  per¬ 
formance  significantly  due  to  this  high  cost  of 
swapping.  With  fewer  windows,  better  swap 
interrupt  and  I/O  support  is  important 

At  compile  time,  the  dynamic  procedure 
profile  is  not  known.  Therefore  the  compiler 
cannot  anticipate  window  overflows  For  this 
reason,  overflows  must  be  detected  on  chip  It 
is  the  cost  of  handling  this  interrupt  which 
accounts  for  thirty  RISC  II  instructions.  The 
compiler  can,  however,  anticipate  swaps  for  a 
single  window  implementation:  every  executed 
call  or  return  would  require  a  save  or  restore, 
respectively. 

Since  each  swap  utilizes  the  same  protocol 
(sixteen  adjacent  registers  swapped  to  /from 
the  current  window)  better  data  I/O  support 
can  be  easily  obtained.  For  example,  a  single 
instruction  may  provide  all  the  necessary  infor¬ 
mation  for  the  swap.  Then  it  is  not  necessary 
to  fetch  individual  Load  or  Store  instructions 
for  each  of  the  registers  Two  data  words  may 
be  passed  on  the  bus  each  machine  cycle, 
instead  of  one  data  word  every  two  cycles  inter¬ 
leaved  with  instruction  fetching 

Table  II  presents  RISC  II  performance  as  a 
function  of  data  I/O  support  and  local  memory 
size.  The  cost  of  each  swap  includes  the  thirty 
cycles  for  interrupt  overhead,  as  well  as  the 
sixteen  data  word  transfers  Since  swap  over¬ 
head  for  Puzzle  is  small,  only  Tower  is  con¬ 
sidered  here.  As  before,  seven  or  more  windows 
are  desirable  for  high  performance,  regardless 
of  the  level  of  data  I/O  support.  This  is  because 
the  interrupt  overhead  is  so  high  An  exception 
is  the  single  window  case,  which  is  seen  to  pro¬ 
vide  best  performance  for  implementations 
with  fewer  than  five  windows. 
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WINDOWS 

1  2  3  5  7  9  - 

One-Half 
Data  I/O 

Per  Cycle 

7.06 

4.91 

3.95 

1.19 

1.05 

1.00 

Single 

Data  1/0 

Per  Cycle 

4.04 

3.90 

3.19 

1.55 

1.14 

1.03 

1.00 

Dual 

Data  I/O 

Per  Cycle 

2.52 

3.39 

2.81 

1.46 

1.12 

1.03 

1.00 

i© 


* 

J 
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TABLE  D:  Tower  Execution  Time  with  Varying  Swap  Support 

(includes  interrupt  overhead  for  multiple  window  cases) 

Improved  Swapping  Strategies 

Thus  far.  we  have  discussed  fixed-size  swap 
overhead  of  RISC  II.  This  scheme  is  attractive 
due  to  its  simplicity,  as  well  as  its  amenability 
to  better  1/0  support.  However,  such  a  scheme 
swaps  all  registers  in  the  window,  whether  they 
were  used  or  not.  A  study  of  several  C  pro¬ 
grams  has  determined  that  a  mean  of  four 
registers  are  used  per  procedure  in  RISC  [7]. 
Therefore,  the  fixed-swap  scheme  performs 
four  times  the  number  of  save  and  restore  data 
I/O  actually  necessary. 

In  order  to  keep  track  of  which  registers 
have  been  used,  a  "dirty  bit"  scheme  may  be 
implemented.  During  each  register  write,  a  bit 
is  set  to  indicate  a  potential  swap  candidate. 
Swaps  would  vary  in  length,  depending  on  the 
number  of  bits  set.  The  increased  hardware 
complexity  necessary  to  support  such  an 
approach,  however,  is  undesirable. 

A  single  window  implementation  does  not 
require  this  hardware.  The  compiler  can  insert 
code  before  each  call  which  saves  utilized  regis¬ 
ters.  Restoring  registers  after  a  return  can  be 
done  on  demand.  Since  not  all  registers  may 
need  to  be  restored,  further  reduction  in  I/O  is 
anticipated.  Save  overhead  may  be  eliminated 
by  performing  a  data  memory  write  in  parallel 
with  all  register  file  writes.  This  requires  a 
dual-bus  microarchitecture  which  can  perform 
an  instruction  and  data  access  each  cycle,  such 
as  the  MIPS  microprocessor  [  10].  Overall  swap 
overhead  may  then  reduce  by  more  than  a  fac¬ 
tor  of  eight. 

Although  we  have  describe l  alternative 
strategies  for  local  memory  management,  we 
have  not  yet  addressed  effective  use  of  on-chip 
memory  area.  These  alternatives  reduce  off- 
chip  register  save  space  by  a  factor  of  four,  but 
on-chip  memory  still  attains  only  Z5Z  utiliza¬ 
tion.  A  variable  size  window  scheme  addresses 


this  issue  by  allowing  dynamic  allocation  of  win¬ 
dow  size.  Several  register  banks  can  be  imple¬ 
mented  for  efficient  replacement  strategy,  and 
procedures  may  span  bank  boundaries.  This 
scheme  requires  additional  hardware,  in  the 
form  of  pointers  for  each  procedure  domain, 
and  an  adder  to  calculate  the  physical 
addresses  of  the  registers. 

Performance  comparison  of  these  schemes 
is  presented  in  Table  III.  All  cases  assume 
single  data  I/O  per  cycle  support  and  four 
registers  per  procedure.  Interrupt  overhead  is 
included  in  all  multiple  window,  as  well  as  the 
variable  size,  implementations  With  few  win¬ 
dows,  significant  performance  improvement  is 
observed.  Using  more  efficient  swapping  stra¬ 
tegies,  high  performance  may  be  attained  with 
fewer  windows. 


WINDOWS 

1  2  3  5  7  9 

Full-Bank 

Register 

Swaps 

4.04 

3.90 

3.19 

1.55 

1.14 

-  - 

1.03 

1.00 

Partial 

Register 

Swaps 

1.76 

2.16 

1.62 

1.41 

1.10 

1.02 

1.00 

Partial 
Swaps  with 
Write  Save 

1.3B 

1.50 

1.31 

1.21 

1.05 

1.01 

1.00 

Variable  Size 
with  Half 
Bank  Swaps 

3.91 

1.60 

1.25 

- 

- 

- 

1.00 

TABLE  IB:  Tower  Execution  Tune  for  Various  Swap  Schemes 
(one  data  1/0  per  cycle  assumed  for  all  cases) 


Register  File  Delay 
Up  to  this  point,  only  I/O  limited  perfor¬ 
mance  has  been  discussed.  From  the 
designer’s  point  of  view,  attention  should  also 
be  focused  on  datapath  bandwidth.  Especially 
for  RISCs,  where  each  execution  cycle  consists 
of  a  standard  register-to-register  operation, 
the  datapath  cycle  time  determines  maximum 
system  performance. 

The  machine  cycle  of  the  RISC  IT  consists  of 
a  dual-port  register  read,  followed  by  an  ALU  or 
shift  operation;  these  latter  operations  overlap 
register  write  and  bitline  precharge  of  the  pre¬ 
vious  instruction.  The  machine  cycle  is  then 
limited  directly  by  the  register  file  read-write- 
precharge  cycle  time 
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Increased  local  memory  size  is  accom¬ 
panied  by  proportionally  greater  bitline  load¬ 
ing  To  reduce  the  bitline  discharge  delay,  the 
wordline  transistors  may  be  widened  at  some 
cost  in  addressing  delay.  Optimal  wordline 
transistor  size  yields  a  register  file  cycle  delay 
which  goes  as  the  square  root  of  memory  size 
[ll]  Datapath  bandwidth  can  then  be  traded 
off  for  reduced  1/0  with  a  larger  local  memory. 

Table  IV  includes  the  effect  of  variable 
register  cycle  delay.  The  partial  register  swap 
schemes  using  a  single  window  yield  the  best 
performance.  A  single  window,  fixed-swap 
implementation  with  dual  data  1/0  per  cycle 
approaches  the  performance  of  the  seven  win¬ 
dow  RISC  11  with  half  data  1/0  per  cycle.  Execu¬ 
tion  time  for  Puzzle,  with  Ettle  swap  overhead, 
follows  the  register  cycle  time  dependence  with 
memory  size;  it  executes  nearly  twice  as  fast 
with  one  window  as  it  does  with  seven. 


TABLE  IV:  Datapath  Bandwidth  limited  Execution  Time 
(smaller  local  memory  is  faster;  using  Tower  benchmark) 


Conclusions 

Chip  .design  tradeoffs  in  VLSI  must  be  made 
using  both  architectural  and  circuit  design  con¬ 
siderations.  One  measure  of  local  memory 
cost,  delay,  yields  two  minimum  execution  time 
scenarios  I/O  limited,  and  datapath  bandwidth 
limited  Because  of  the  limited  number  of  pads 
which  can  be  placed  on  a  chip,  memory  1/0  is  a 
severe  bottleneck  in  system  performance.  For 
this  reason,  a  large  local  memory  was  chosen 


for  RISC  II.  Presently,  memory  speed  is  increas¬ 
ing,  making  datapath  bandwidth  a  more  critical 
determinant  of  system  performance.  In  the 
future,  increased  chip  resources  will  support 
-greater  local  memory  hierarchy  [12],  the  I/O 
bottleneck  may  then  be  replaced  by  datapath 
limited  performance. 
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Abstract — The  RISC  (reduced  instruction  set  computer)  archi¬ 
tecture  attempts  to  achieve  high  performance  without  resorting  to 
complex  instructions  and  irregular  pipelining  schemes.  One  of  the  novel 
features  of  this  architecture  is  a  large  register  file  which  is  used  to 
minimize  the  overhead  involved  in  procedure  calls  and  returns.  This 
paper  investigates  several  strategies  for  managing  this  register  file. 
The  costs  of  practical  strategies  are  compared  with  a  lower  bound  on 
this  management  overhead,  obtained  from  a  theoretical  optimal 
strategy,  for  several  register  file  sizes. 

While  the  results  concern  specifically  the  RISC  processor  recently 
built  at  U.C.  Berkeley,  they  are  generally  applicable  to  other  processors 
with  multiple  register  banks. 

Index  Terms — Cache  fetch  strategies,  computer  architecture, 
procedure  calls,  register  file  management,  RISC,  VLSI  processor. 

I.  Introduction 

INVESTIGATIONS  of  the  use  of  high-level  languages 
show  that  procedure  call/return  is  the  most  time-con¬ 
suming  operation  in  typical  high-level  language  programs  [8], 
[9]  due  to  the  related  overhead  of  passing  parameters  and 
saving  and  restoring  of  registers.  The  RISC  architecture  [8], 
[9]  includes  a  novel  scheme  that  results  in  highly  efficient 
execution  of  this  operation. 

In  conventional,  register-oriented  computers,  the  procedure 
call/rcturn  mechanism  is  based  on  a  LIFO  stack  of  variable 
size  invocation  frames  (activation  records).  When  a  procedure 
is  called,  an  area  on  top  of  the  stack  is  used  for  storing  the  input 
arguments,  saving  the  return  address  and  register  values,  al¬ 
locating  local  variables  and  temporaries,  and,  if  the  procedure 
calls  another  procedure,  storing  output  arguments.  A  proce¬ 
dure’s  invocation  frame  denotes  this  area  on  the  stack.  At  any 
point  in  time,  the  number  of  invocation  frames  in  the  stack  is 
the  current  nesting  depth.  The  invocation  frame  of  the  calling 
procedure  overlaps  that  of  the  called  procedure  so  that  the 
memory  locations  containing  the  parameters  passed  from  the 
calling  procedure  to  the  called  procedure  are  part  of  both 
frames. 

In  most  computers,  register/register  operations  can  be 
performed  faster  than  the  corresponding  memory/memory 
operations.  Therefore,  the  most  heavily  used  local  variables 
and  temporaries  are  placed  in  registers.  When  a  procedure  is 
called,  it  must  save  the  value  of  all  the  registers  it  will  use  and 
restore  these  values  before  returning  control  to  the  calling 
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procedure.  Analysis  of  the  dynamic  behavior  of  Pascal  and  C 
programs,  executing  on  a  VAX  1 1  / 7S0.  has  shown  [8],  [9]  that 
saving  and  restoring  register  values  and  writing  and  reading 
of  parameters  from  the  common  area  of  the  caller  and  the 
callcc  arc  responsible  for  more  than  40  percent  of  the  data 
memory  references. 

In  RISC,  the  call/rcturn  mechanism  is  based  on  two  LIFO 
stacks.  One  of  the  stacks  (henceforth  “STACK1")  contains 
fixed  size  frames  which  hold  scalar  quantities  of  the  invocation 
frame  (i.c.,  scalar  input  arguments,  the  return  address,  scalar 
output  parameters,  and  scalar  local  variables  and  temporaries). 
The  second  stack  (henceforth  “STACK2")  contains  variable 
size  frames,  some  of  which  may  be  empty  (i.c.,  their  size  is 
zero).  This  stack  is  used  for  all  nonscalar  variables  which  are 
normally  placed  on  the  single  stack  in  conventional  computers. 
It  is  also  used  for  scalars  in  case  there  is  not  enough  space  in 
the  fixed  size  frame  on  STACK  1 . 

The  size  of  the  STACK  1  frame  in  RISC  was  determined 
based  on  a  study  by  Halbert  and  Kessler  [5).  The  dynamic 
behavior  of  nine  nonintcractive  UNIX™  C  programs  was  an¬ 
alyzed.  These  programs  included  the  main  part  of  the  C 
compiler  ccom ,  the  Pascal  interpreter  pi,  the  UNIX  copy 
command  cp ,  the  troff  text  formatter,  and  the  UNIX  sort 
program.  This  study  showed  that  a  fixed  frame  size  of  22 
“words"  (22  registers),  with  an  overlap  of  six  “words”  between 
adjacent  frames,  is  sufficient  for  all  the  scalar  variables  and 
arguments  in  over  95  percent  of  the  procedure  calls. 

The  implementation  of  STACK2  in  RISC  is  identical  to  the 
implementation  of  the  single  LIFO  stack  in  conventional 
computers:  the  stack  itself  resides  in  memory,  there  is  a  pro¬ 
cessor  register  that  serves  as  a  stack  pointer,  and  there  is  an¬ 
other  register  that  serves  as  the  frame  pointer  [4].  There  is  no 
special  hardware  support  for  operations  on  STACK2  but,  due 
to  ST  AC  Kl,  such  operations  arc  far  less  frequent  than  oper¬ 
ations  on  the  LIFO  stack  of  conventional  computers.  Since  the 
implementation  and  operation  ofSTACK2  is  identical  to  those 
of  the  slack  in  conventional  computers,  STACK2  will  not  be 
discussed  any  further  in  this  paper. 

In  conventional  computers,  registers  arc  used  for  storing  part 
of  the  invocation  frame  of  the  currently  executing  procedure 
(i.e.,  the  top  frame  on  the  stack).  In  RISC,  there  is  a  large 
register  file  that  is  divided  into  several  fixed  size  “register 
banks,”  each  of  which  can  hold  one  STACK  1  frame  Since 
each  STACK  I  frame  partially  overlaps  the  previous  STACK! 
frame  and  the  next  STACK!  frame,  each  register  bank  shares 
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sonic  of  its  registers  with  the  two  neighboring  register 
banks. 

The  STACK  I  frame  used  by  the  currently  executing  pro¬ 
cedure.  is  always  in  one  of  the  register  banks.  At  each  point  in 
lime,  the  contents  of  one  of  the  register  banks  are  addressable 
as  registers,  thus  providing  a  "window"  into  the  register  file. 
This  register  bank  is  always  the  one  containing  the  STACK  I 
frame  of  the  currently  executing  procedure.  A  procedure  call 
modifies  a  hardware  pointer  and  "moves"  the  window  to  the 
next  register  bank  in  the  register  file,  where  the  STACK  I 
frame  of  the  called  procedure  resides.  Thus,  for  example, 
register  15  (RI5)  in  the  calling  procedure  is  in  a  different 
physical  position  in  the  register  file  from  R 1 5  in  the  called 
procedure,  although  the  operand  specifier  for  Rl  5  is  identical 
in  the  two  procedures. 

A  return  instruction  restores  the  previous  value  of  the  above 
mentioned  hardware  pointer  so  the  previous  values  of  all  the 
registers  arc  “restored"  without  any  data  movement.  Fur¬ 
thermore,  no  memory  references  arc  required  for  passing 
arguments  since  they  arc  passed  in  registers  which  are  in  the 
region  of  overlap  between  the  register  banks  containing  the 
STACK  I  frames  of  the  caller  and  the  callce. 

By  using  this  scheme,  a  procedure  call  in  RISC  can  be  made 
as  fast  as  a  jump  and  with  fewer  accesses  to  data  memory  than 
arc  required  in  conventional  computers. 

Since  the  size  of  the  register  file  is  limited,  there  is  a  need 
for  a  mechanism  which  will  handle  the  case  when  the  proce¬ 
dure  nesting  depth  exceeds  the  number  of  STACK!  frames 
which  fit  in  the  register  file.  When  a  procedure  call  is  executed, 
a  new  “empty"  register  bank  is  needed.  If  all  the  register  banks 
in  the  register  file  are  in  use,  an  “overflow"  occurs.  This  ov¬ 
erflow  causes  a  trap  which  is  handled  by  operating  system 
software.  The  operating  system  must  free  one  or  more  register 
banks  to  make  room  for  the  new  frame.  Since  the  STACK! 
frames  in  the  register  banks  which  are  “freed"  must  be  pre¬ 
served.  the  software  copies  the  frames  to  a  conventional  LI  FO 
stack  which  is  kept  in  memory  and  contains  only  STACK  I 
frames. 

When  a  return  instruction  is  executed,  the  window  must  be 
moved  to  a  register  bank  containing  the  previous  frame  (Le¬ 
the  frame  of  the  calling  procedure).  If  all  the  register  banks 
are  free  (i.e.,  the  calling  frame  is  not  resident),  an  "underflow" 
occurs.  This  underflow  causes  a  trap,  upon  which  the  operating 
system  software  loads  one  or  more  frames  from  memory  where 
they  were  stored  when  an  overflow  occurred. 

The  register  file  is  simply  a  write-back  cache  of  STACK  1. 
The  cache  blocks  are  the  STACKl  frames.  The  top  few  frames 
of  STACK  I  are  in  the  register  file  while  the  rest  are  in  mem¬ 
ory.  When  an  underflow  occurs,  one  or  more  occupied 
STACKl  frames  are  fetched  from  memory.  When  an  overflow 
occurs,  one  or  more  register  banks  are  “freed."  This  can  be 
interpreted  as  “fetching"  empty  STACKl  frames  from 
memory  Since  in  both  cases  the  “fetching"  is  done  by  software, 
there  is  great  flexibility  in  defining  the  cache  fetch  strategy 
(algorithm)  (10).  This  strategy  determines  the  number  of 
frames  to  be  moved  to/from  memory  when  an  overflow/ 
underflow  occurs. 

In  this  paper,  several  fetch  strategies  are  considered.  A 
theoretical  "optimal  strategy"  is  developed  and  is  used  as  a 


reference  point  for  evaluating  the  performance  of  several 
practical  strategics.  In  addition,  the  effect  of  register  file  size 
on  the  performance  of  different  strategics  is  investigated. 

II.  Tm-r  OPTIMA!.  STRATEGY 

In  this  section  an  optimal  strategy  for  managing  the  register 
file  will  be  discussed.  This  strategy  requires  unbounded  look¬ 
ahead  (possibly  to  the  end  of  the  call/rcturn  trace)  and  is 
therefore  only  useful  as  a  lower  bound  on  the  cost  of  practical 
strategics.  A  proof  that  the  proposed  strategy  is,  in  fact,  "op¬ 
timal"  is  presented. 

A.  Definitions 

In  order  to  facilitate  further  discussion,  some  formal  defi¬ 
nitions  arc  required 

When  a  program  is  executing,  its  nesting  depth  constantly 
changes:  every  procedure  call  increases  the  nesting  depth  by 
one  and  every  return  decreases  the  nesting  depth  by  one. 
Hence,  for  every  execution  of  a  program,  there  is  a  corre¬ 
sponding  sequence  of  nesting  depths.  This  sequence  will  be 
called  a  procedure  nesting  depth  sequence  (PNDS). 

Definition  I:  A  procedure  nesting  depth  sequence  (PNDS) 
is  a  sequence  of  integers  D  =  (d\.  ds. '  •  • ,  d„)  where  d\  —  1, 
d,  >  0  for  I  <  /  <  n  and  | d,  —  =  I  for  2  <  /  < 

The  integer  /  is  an  index  into  the  PNDS;  d \  is  the  nesting 
depth  at  the  beginning  of  the  program.  For  each  /,  2  <  i  <  it , 
dj  is  the  nesting  depth  after  /  —  I  calls  and  returns  are  executed 
(i.e.,  after  /  -  I  changes  in  the  nesting  depth).  Henceforth,  an 
index  into  the  PNDS  will  be  called  a  location.  An  example  of 
a  PNDS  is  shown  in  Fig.  I . 

The  frames  of  STACKl  arc  numbered  from  I  to  in  (with 
m  being  the  current  nesting  depth,  i.e.,  the  number  of  the  frame 
of  the  currently  executing  procedure).  The  lop  (i.e.,  highest 
numbered)  few  frames  of  the  slack  arc  always  in  the  register 
file  while  the  rest  arc  in  memory. 

Definition  2:  The  register  file  position  (RFP)  is  the  number 
of  the  lowest  numbered  frame  which  is  in  the  register  file. 

When  an  overflow  occurs,  the  lowest  number  frame(s)  in 
the  register  file  arc  copied  to  memory  and  the  register  banks 
they  occupy  in  the  register  file  arc  “freed."  This  increases  the 
register  file  position.  Similarly,  when  an  underflow  occurs  the 
RFP  is  decreased.  Thus,  the  number  of  times  the  RFP  is 
changed  during  the  execution  of  the  program  is  equal  to  the 
sum  of  the  number  of  overflows  and  the  number  of  underflows 
which  occur. 

Definition  3:  A  register  file  more  (RFM)  denotes  an  in¬ 
crease  or  decrease  in  the  register  file  position. 

Definition  4:  The  size  of  the  register  file  move  is  the  absolute 
value  of  the  difference  between  the  RFP  before  the  move  and 
the  RFP  after  the  move. 

If  the  current  nesting  depth  is  d,  the  STACKl  frame  being 
used  by  the  currently  executing  procedure,  is  the  one  labeled 
d.  The  register  file  position  must  be  such  that  this  frame  is 
contained  in  the  register  file.  Hence,  if  the  register  file  can  hold 
w  frames  and  if  the  RFP  is  p.  thcn/>  <  d  <p  +  m  Before  ex¬ 
ecution  begins,  the  RFP  is  some  positive  integer  p0.  During  the 
execution  of  a  program  w  ith  a  PNDS  D  =  (d |.  dj.  ■  ■  ■ ,  d„).  for 
each  nesting  depth  dn  the  corresponding  RFP/>,  must  be  such 
that  the  above  condition  is  satisfied,  i.e.,  p,  £  d,  <  p,  +  w 
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Fig.  I.  PNDS  nnd  oplimul  Rf  PV 


Definition  5:  Given  ;i  PNDS  D  =  (</,,  <l2.  ■  ■  ■ .  d„)  and  a 
register  lllc  th;il  e:in  hold  u  frames,  a  valid  register  file  position 
sequence  ( R I  PS)  is  a  sequence  of  R FPY.  P  =  (pn,  /?,. p2.  ■  ■  • , 
p„ )  sueh  that  lhc/>;  s  are  positive  integers  and  for  all  /,  I  <  i 

<  n.  p,  <  Jj  <  p,  +  if. 

There  is  a  one-to-one  eorrespondenee  between  nesting 
depths  in  the  PNDS  and  RFP's  in  the  RFPS.  Successive 
RFP's,  pj-\  and  p,.  in  the  RFPS  may  be  unequal  or  equal 
depending  on  whether  the  register  file  position  is  modified 
between  the  j  —  2  and  j  -  I  change  in  the  nesting  depth. 

Definition  6:  If  P  =  (pt).  p\,  p2.  •  •  ■  .  p„)  is  an  RFPS.  an 
R  FM  is  said  to  occur  in  location  ( ( I  <  j  <  n)  of  P.  if  and  only 
if  Pi  *  P)-\- 

The  number  of  RF  Vi's  which  occur  during  some  interval  in 
which  the  program  is  executing  is  of  interest  in  this  paper.  Tnc 
interval  is  defined  as  a  subsequence  of  the  R  FPS  (which  cor¬ 
responds  to  a  subsequence  of  the  PNDS). 

Definition  7:  l  f  P  =  (/?,>,  pt.  p2.  ■  ■  ■ ,  p„ )  is  an  RFPS.  the 
number  of  RFM's  occurring  in  location  range  [ i ,  j |  of  P.  where 
I  <  i  <  j  <  n,  is  the  number  of  unique  integers  A,  such  that  / 

<  k  <  j  and  p*  ^  p*-|.  This  number  will  be  denoted  by 
RFM /»[/.  _/J. 

Definition  8.  If  P  =  (p0,  p\,  p2.  •  ■  •.  pn).  is  an  RFPS,  the 
uieuton  traffic  occurring  in  location  range  [/./)  of  P.  where 
I  5:  /  <  j  <  n.  is  the  total  number  of  ST  ACK I  frames  moved 
to/from  memory  as  the  RFP  is  set  to p,.  p,  +  \.  ■  ■  • .  p,  succes¬ 
sively.  This  number  is  denoted  by  MT/<[». y  |: 

MTrl/./l  =  L  Pk  Pk  —  i 

k  ■»! 

B  What  is  an  "Optimal  Strategy"? 

There  is  some  overhead  involvcd  in  handling  overflow/ 
underflow  traps:  saving  the  current  state,  determining  the 
cause  of  the  trap,  activating  the  proper  trap  handling  routine, 
restoring  state,  and  returning  to  normal  execution.  Hence,  it 
is  desirable  to  minimize  the  number  of  register  file  overflows 
and  underflows,  In  addition,  there  is  the  direct  cost  involvcd 
in  actually  moving  the  data  to/from  memory.  For  each  register 


file  move,  this  cost  is  proportional  to  the  number  of  frames 
moved  (i.c..  to  the  size  of  the  register  file  move).  Hence,  it  is 
desirable  to  minimize  the  number  of  frames  moved  for  each 
overflow /underflow,  i.e.,  the  memory  traffic  which  is  the  result 
of  overflows  and  underflows. 

The  problem  of  finding  the  "best"  RFPS  is  similar  to  finding 
optimal  strategies  for  handling  page  faults  in  virtual  memory 
systems.  In  virtual  memory  systems  it  is  also  desirable  to 
minimize  both  the  number  of  page  faults  (since  there  is  over¬ 
head  involved  in  handling  sueh  faults)  and  the  I/O  involvcd 
in  moving  memory  pages  to/from  disk  or  drum.  For  the  virtual 
memory  problem.  Belady  [  1 1  developed  an  "optimal"  page 
replacement  algorithm  which  causes  the  fewest  possible  page 
faults  for  a  program  which  executes  in  a  fixed  number  of  main 
memory  page  frames.  Bclady's  algorithm  is  not  realizable  since 
it  requires  knowledge  of  the  future  portion  of  the  page 
trace. 

In  the  next  sections  it  is  shown  that  if  the  entire  call  return 
trace  of  the  program  (i.c..  the  PNDS)  is  known,  there  exists 
a  RFPS  which  achieves  hath  the  minimum  number  of  over- 
flow/undcrflow  traps  and  the  minimum  memory  traffic  re¬ 
sulting  from  register  file  moves.  It  is  further  shown  that 
knowledge  of  the  entire  PNDS  is  necessary  for  achieving  an 
optimum  RFPS. 

C.  The  Existence  of  an  Optimal  RFPS 

In  order  to  prove  the  existence  of  an  optimal  RFPS.  an  al¬ 
gorithm  for  deriving  sueh  an  RFPS  from  a  given  PNDS.  is 
presented.  The  optimality  of  the  RFPS  produced  by  the  al¬ 
gorithm  is  shown  by  proving  that  no  other  valid  RFPS  can  have 
fewer  register  file  moves  or  result  in  less  memory  traffic 

An  optimal  RFPS  can  be  obtained  as  follows  We  start  with 
the  RFP  at  I  and  keep  it  there  until  the  nesting  depth  exceeds 
the  number  (w)  of  register  banks  in  the  register  file.  Now  the 
RFP  must  be  changed,  i.c..  an  RFM  must  occur  In  order  to 
determine  the  optimal  size  of  the  RFM.  we  must  look  ahead 
in  the  call/rcturn  trace  (i.c.,  in  the  PNDS)  Starling  from  the 
current  location,  we  determine  the  longest  subsequence  of  the 
PNDS  for  which  a  constant  RFP  is  possible  (i.c  .  in  which  the 
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difference  between  the  maximum  nesting  depth  and  the  min¬ 
imum  nesting  depth  docs  not  exceed  h'  -  I ).  The  new  RFP  is 
chosen  so  that  it  is  valid  for  this  entire  subsequence.  From  the 
end  of  this  subsequence  we  repeat  the  procedure  until  the  entire 
PNDS  is  covered. 

Special  handling  is  required  when  determining  the  RFP  for 
the  last  subsequence  in  the  PNDS.  In  this  case  the  difference 
between  the  maximum  and  minimum  nesting  depth  within  the 
subsequence  may  be  less  than  w  -  I .  Hence,  there  is  some 
freedom  in  setting  the  RFP.  In  order  to  minimize  the  memory 
traffic,  the  new  R  FP  is  chosen  so  that  it  is  valid  for  the  entire 
subsequence  and  the  absolute  value  of  the  difference  between 
the  new  RFP  and  the  previous  RFP  is  minimized.  An  example 
of  a  PNDS  and  the  corresponding  optimal  RFPS  is  shown  in 
Fig.  I . 

More  formally,  the  procedure  can  be  stated  as  follows.  Given 
an  arbitrary  PNDS  D  =  (d\.  </i,  •  •  • ,  d„).  an  optimal  RFPS 
P  =  (po.  pu  Pi* ' 1  - .  />«).  for  a  register  file  that  can  hold  tv 
frames,  can  be  obtained  as  follows: 

[  I  ]  let  i  =  I ,  po  =  I 

[2]  repeat 

[3]  let  £  =  (dt,di+\,-"  ,dm) 

where  m  is  the  maximum  integer  such  that 
\  <  m  <  n  and  max  (£)  —  min  (E)  <  tv 

[4]  if  (m  <  n)  or  (/>>_  i  >  min  (£))  then 

[5]  for  j  =  i  to  n\ 

let  Pj  —  min  (£) 

[6]  else 

[7]  for  j  -  i  to  hi 

let  pj  =  max  (£)  -  w  +  I 

[8]  let  i  =  m  +  I 

[9]  until  i  >  n 

First,  it  must  be  shown  that  the  algorithm  generates  a  valid 
RFPS  for  the  given  PNDS.  Proof  of  the  validity  of  the  algo¬ 
rithm  and  of  the  generated  RFPS  requires  proving  the  fol¬ 
lowing  lemmas. 

Lemma  I:  The  repeat  and  for  loops  terminate  after  a  finite 
number  of  iterations,  i.e.,  the  algorithm  always  terminates. 

Proof:  Since  n  is  finite  and  f  is  incremented  by  at  least  I 
during  each  iteration  of  the  repeat  loop,  n  is  an  upper  bound 
on  the  number  of  iterations  through  the  repeat  loop. 

It  is  always  true  that  i  2:  1  and  m  <  n.  Hence,  n  is  an  upper 
bound  on  the  number  of  iteration  through  the  for  loop  (either 
one)  each  time  it  is  entered.  ■ 

Lemma  2:  For  all  /,  I  <  i  <  n,  pt  <  d,  <  p,  +  n\  i.e.,  the 
RFP’s  generated  by  the  algorithm  are  valid. 

Proof:  From  the  algorithm,  if  />,  is  set  in  step  5,  then  Pi 
^  dt  [since /?/  “  min  (£)]  and  d,  -  p,<tv  (since  max  (£)  - 
min  (£)  <  w).  Hence,  pt  <  d,-  <  pt  +  k>. 

If  pi  is  set  in  step  7,  then  p,  >  dj  -  w>  +  I  (since  pt  =  max 
(£)  —  tv  +  I)  and  p<  +  tv  -  I  -  d,  <  tv  (since  max  (£)  =  p, 
+  w  —  I  and  max  (£)  -  min  (£)  <  tv).  From  the  first  in¬ 
equality,  dt  <  pi  +  w  -  I  and  from  the  second  inequality  pt 
<  di+  1 .  Hence,  pt  <  dt  <  p,  +  w.  m 

The  proof  of  the  optimality  of  the  generated  RFPS  requires 
some  additional  notation.  The  subsequence  £  which  is  defined 
during  the  Ath  iteration  of  the  repeat  loop  will  be  denoted  £* 


(it  corresponds  to  the  Ath  setting  of  the  RFP)  The  corre¬ 
sponding  integer  m  will  be  denoted  hi*.  For  convenience  in 
notation,  we  define  h?0  =  0.  The  number  of  iterations  that  the 
repeat  loop  executes  before  terminating  will  be  denoted  by  A 
(it  corresponds  to  the  number  of  times  that  the  RFP  is  ad¬ 
justed).  Note  that  I  <  hi i  <  hi;  <  •  •  •  <  niK  =  n 

For  each  location  range,  [nn-i  +  I .  hi*  J,  the  RFP's  in  the 
RFPS  generated  by  the  algorithm  arc  constant.  Within  this 
location  range.  4’*  and  'P*  are  the  locations  of  the  first  oc¬ 
currence  of  the  minimum  and  maximum  nesting  depths,  re¬ 
spectively.  More  formally,  see  the  following. 

Definition  9:  <P*  and  'P*  arc  the  smallest  integers,  such  that 
for  each  k  ( I  <  k  <  K).  both  arc  in  the  location  range  [hi*-i 
+  I .  hi*],  where  d, j.t  =  min  (£* )  and  d*k  =  max  (£* ). 

In  order  to  prove  the  optimality  of  the  RFPS  generated  by 
the  algorithm,  it  must  be  shown  that  this  RFPS  results  in  the 
lowest  possible  memory  traffic.  This  w  ill  be  done  by  using  in¬ 
duction  on  the  K  boundaries  of  K  -  I  location  ranges.  These 
boundaries  are  defined  below  and  arc  denoted  by  0*.  for  all 
k  such  that  I  <  A  <  K.  The  boundary  point  0*  is  the  location 
of  the  first  minimum  or  maximum  nesting  depth  within  the 
location  range  [hi*_i  +  I ,  hi*  j .  I f  the  RFP  in  the  RFPS  gen¬ 
erated  by  the  algorithm  for  location  range  [hi*_i  +  I ,  hi*]  is 
less  than  the  RFP  for  location  range  [hi*_2  +  I ,  hi*_i],  then 
0*  =  4>* ,  otherwise  0*  =  'P*.  More  formally: 

Definition  10:  0*  is  an  integer  such  that  for  each  A,  2  <  A 
<  A,  0*  =  4>*  if  d*k  <  d*k_ ,,  and  0*  =  'P*  otherwise.  For 
convenience  in  notation,  we  define  0,  =  I . 

Fig.  I  shows  the  PNDS  from  the  execution  of  Ackerman’s 
function  with  arguments  (2,  I ).  The  dotted  squares  show  the 
“optimal”  RFP’s  for  a  register  file  that  can  hold  three  frames. 
In  this  example,  five  RFM’sare  necessary  (RFM/>[I,  29]  = 
5,  K  =  6)  and  the  memory  traffic  resulting  from  those  RFM's 
is  12  frames  (MT/>[I,  29]  =  12). 

Let  Q  =  (<7o-  9 1.  Q  2.  ’  ”  ,q„)  bean  arbitrary  valid  RFPS  for 

D. 

The  rest  of  this  section  contains  a  formal  proof  that  the 
number  of  RFM’s  in  Pand  the  memory  traffic  resulting  from 
those  RFM’s  are  at  most  equal  to  the  number  of  RFM’s  in  Q 
and  the  memory  traffic  resulting  from  those  RFM’s,  respec¬ 
tively. 

Lemma  3:  If  K  >  I ,  then  for  all  A,  I  <  A  <  K,  RFMp[  I ,  hi* 
+  I]  >  A. 

Proof:  See  the  Appendix. 

From  the  algorithm,  for  all  A,  I  <  A  <  A,  d*i  -  d*  *  <  w 

—  I .  It  is  now  shown  that  for  I  <  A  <  A  -  I ,  d*k  -  d*k  =  m> 

—  I. 

Lemma  4:  If  A  >  I,  then  for  all  A,  I  <  A  <  A  —  I ,  d*k  — 
d$k  w  ™  I . 

Proof:  See  the  Appendix. 

It  should  be  noted  that  Lemma  4  makes  no  claims  about  the 
value  of  (d*K  -  d*K ),  i.e.,  it  makes  no  claims  about  the  case 
A  =  A.  From  the  algorithm  it  is  clear  that  d*k  -  d*k  <  m.  So 
it  is  quite  possible  that  d+k  -  d*,k  <tv  —  | . 

Lemma  5:  If  A  >  I,  for  all  A,  I  <  A  <  A  —  I,  for  all  i,  m*_| 
+  I  £  i  £  m* ,  Pi  =  </**  =  <f+*  —  tv  +  I .  For  the  last  subse¬ 
quence,  i.e.,  A  =  A:  if  0*  “  4>* .  then  /z*  =  </«,,,  clse/t,  =  d+k 

—  w  +  1 . 

Proof:  See  the  Appendix. 


TAMIR  AND  SfQl  IN  MANAGING  THE  RhGISTIR  HI  I  IN  RISC 


981 


Lemma  6.  If  K  >  1,  then  for  alU,  1  <  k  S  K,  MTg[l,  0*  ] 

>  MT,[l.e*]+|?«,  -p„k\. 

Proof:  See  the  Appendix. 

Using  the  above  lemmas,  we  can  formally  prove  the  “opti¬ 
mality"  of  the  RFPS  generated  by  the  algorithm. 

Theorem  I :  The  RFPS  P  generated  by  the  algorithm  for  the 
PNDS  D  is  an  optimal  RFPS  for  D,  i.e.,  if  Q  is  an  arbitrary 
valid  RFPS  for  D,  then 

RFM/»(1 ,  n]  <  RFM^[1,  n] 

and 


MT/)[1 ,  n]  <  MT0(l,n]. 


lnt  depthliit[]  *  |  /*  Thu  ti  tht  PNDS,  0  Urmin*l*d  */ 

1.  2.  3.  2.  3.  4  3.  2.  1.  0  j 
lnl  depthmd  *  t  . 
mun() 

vhil«  (depthli*l(d«pthmdj  >  1)  { 
deeper(2) ; 

depthind  *  depthmd  ♦  1  . 

J 

I 

deeper(curdep) 

lnt  curdep  .  /*  The  current  netting  depth  V 

depthind  *  depthind  ♦  1 
while  (dcpthUst[depthind]  >  curdep)  { 
deep«r(curd«p+ 1) . 
depthind  ■  depthind  ♦  1  ; 

1/  (depthtnt( depthind]  **  0) 
exit(0) . 

J _ I 


Proof:  If  K  =  1,  then  there  are  no  RFM’s  in  P  so 
RFM/>[1 ,  rt]  =  0,  MT/>[1 ,  n]  =  0,  and  the  theorem  holds. 

Assume  K  >  1.  In  the  algorithm,  all  the  RFP’s,  corre¬ 
sponding  to  the  same  subsequence,  are  set  to  the  same  value 
(step  5  or  step  7).  Hence,  the  only  way  that  p,  ^  pi+ 1  can  occur 
is  if  /  =  mi  for  some  k,  1  <  k  <  K  -  l.Thus,  RFM/>(1,  n)  < 
K-  1. 

From  Lemma  3.  K  -  1  <  RFMp(l,  m^-i  +  1],  Since 
H  m A.-i  +  I  £  rt,  RFM(j[l,ffiK_i  +  1]  <  RFM^(l,n).  Hence, 

K-  \  <  RFM^(1 .  n].  Thus,  RFM/»[1 ,  n]  <  RFM0[l,n]. 

From  Lemma  6,  MT(j(l,  0A]  >  MT/>[I , 0*]  +  |<70*  - 
PoJ.  Since  |q„A  -p»J  >  0,  MTo[l.0*]  >  MT/>[1,0*]. 
Since  0*:  <  n.  MTe(l,n]  >  MT^Il.O/c).  Since  Jo*  e  EK 
and  d„  e  £/c.Po*  =  Po*+i  ~"~Pn ■  Thus,  MT/>[0*  +  1 , 
m  /»)  =O,soMT,,(I,0a)  *  MT/>(1 , n).  Hence.  MT/.JI , n)  < 

MT0(1.«]. 

Q.E.D. 


lo 


D.  The  Unrealizability  of  an  Optimal  Strategy 

When  a  computer  is  executing  a  program,  the  entire  call/ 
return  trace  is  not  known  ahead  of  time.  In  fact,  it  is  unlikely 
that  there  is  any  look-ahead  possible.  In  this  section  it  is  shown 
that  knowledge  of  the  entire  PNDS  is  necessary  for  finding  an 
optimal  RFPS. 

First,  it  should  be  noted  that  no  simplifying  assumptions 
about  the  properties  of  the  call/return  trace  of  “real"  programs 
can  be  made.  In  other  words,  for  every  given  sequence  of  in¬ 
tegers  which  satisfies  the  definition  of  a  PNDS  (Definition  1), 
it  is  possible  to  construct  a  real  program  whose  sequence  of 
nesting  depths  is  the  given  sequence.  This  is  demonstrated  by 
the  program  in  Fig.  2  (which  is  written  in  the  C  language  (7]). 
When  this  program  is  executed,  its  sequence  of  nesting  depths 
is  identical  to  the  sequence  of  integers  in  the  array  depthlist 
(assuming  that  the  sequence  of  integers  in  depthlist  is  a  valid 
PNDS). 

To  show  that  unbounded  look-ahead  on  the  call/rcturn  trace 
is  necessary  for  achieving  an  optimal  RFPS,  consider  a  system 
where  there  is  a  bounded  (or  nonexistent)  look-ahead;  more 
specifically,  a  system  where  at  each  point  in  time  only  the  next ' 
t  calls  and  returns  are  known  in  advance.  (Note  that  in  most 
systems  t  *  0.)  Assume  that  the  register  file  of  the  system  can 
hold  w  frames  and  that  there  arc  two  programs  to  be  executed 
PROG  I  and  PROG2.  These  programs  have  identical  call/ 
return  traces  for  the  first  s  calls  and  returns,  where  w  +  t  < 
s  At  some  point,  before  s  -  t  calls/returns  are  executed,  the 
nesting  depth  (in  both  programs)  reaches  w  +  1.  The  nesting 


Fig.  2.  A  program  whose  "behavior"  follows  an  arbitrary  PNDS. 

depth  stays  between  2  and  w  +  1  until  a  total  of  s  calls/returns 
are  executed.  After  that,  in  PROGl  the  nesting  depth  de¬ 
creases  and  the  program  terminates  at  nesting  depth  1.  In 
PROG2,  on  the  other  hand,  the  nesting  depth  increases  to  w 
+  2  and  then  decreases  until  the  program  terminates  at  nesting 
depth  1. 

In  both  programs,  when  the  nesting  depth  first  reaches  w 
+  I,  the  same  information  about  the  call/return  trace  is 
available,  and  therefore  any  strategy  for  managing  the  register 
file  will  result  in  the  same  action  being  taken  for  both  pro¬ 
grams.  This  action  is  clearly  not  optimal  for  at  least  one  of  the 
programs.  For  PROG  I ,  the  optimal  action  is  to  move  one 
frame  to  memory.  This  action  is  not  optimal  for  PROG2  since 
another  overflow  will  occur  when  a  nesting  depth  of  w  +  2  is 
reached.  The  optimal  action  for  PROG2  is  to  move  two  frames 
to  memory  so  that  only  one  overflow  will  occur  during  the 
execution  of  the  program.  Moving  two  frames  to  memory  is 
not  the  optimal  action  for  PROG  I  since  it  results  in  unneces¬ 
sary  memory  traffic:  moving  two  frames  to  and  from  memory 
instead  of  one. 

The  fact  that  an  optimal  strategy  is  not  realizable  does  not 
imply  that  all  practical  strategies  for  managing  the  register 
file  are  equally  bad.  As  seen  in  the  next  section,  simple  changes 
in  the  strategy  for  managing  the  register  file  may  significantly 
affect  the  cost  of  handling  calls  and  returns. 

III.  Practical  Strategies  for  Managing  the 
Register  File 

In  most  real  systems,  no  look-ahead  at  the  call/rcturn  trace 
is  possible.  Thus,  the  decision  as  to  how  many  frames  should 
be  moved  to/from  memory  when  an  overflow/undcrflow  oc¬ 
curs  must  be  based  on  the  previous  behavior  of  the  executing 
program  or  be  completely  independent  of  the  PNDS  of  the 
executing  program 

As  indicated  above,  two  factors  contribute  to  the  cost  (ex¬ 
ecution  time)  of  handling  register  file  overflows  and  under¬ 
flows  the  handling  of  the  interrupt/trap  that  is  initiated  by 
the  ovcrflow/underflow  and  the  actual  transfer  of  the 
STACK!  frames  to/from  memory.  If  the  number  of  frames 
which  arc  moved  when  an  interrupt  occurs  is  not  Fixed,  some 
computation  may  be  required  in  order  to  calculate  this  number 
The  cost  of  this  calculation  is  included  in  the  cost  ol  handling 
the  interrupt  In  order  to  evaluate  the  effectiveness  of  different 
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strategics  for  nunaging  the  register  file,  these  strategies  can 
be  tried  out  on  thccall/rciurn  trace  of  benchmark  programs. 
The  number  of  ovcrflows/underflows  and  transfers  of 
STACK  I  frames  which  result  from  each  strategy  can  thus  be 
determined.  These  numbers  can  then  be  related  to  the  cost  of 
the  overflow /underflow  handler  using  the  following  for¬ 
mula 

cost  =  (\  X  (number  overflows  +  number  underflows) 

+  /(3  X  (number  frames  moved) 

where  or  and  are  constants:  or  is  the  cost  of  responding  to  the 
interrupt  and  calculating  the  number  of  frames  to  be  moved, 
and  (t  is  the  cost  of  moving  one  STACK  I  frame  to  or  from 
memory. 

.-I.  Measurement  Technique 

The  method  used  for  obtaining  the  call/rcturn  trace  of  the 
benchmark  programs  used  in  this  paper  relies  on  the  fact  that 
the  call/rcturn  trace  of  a  program  executing  on  a  RISC 
computer  is  identical  to  the  call/rcturn  trace  of  the  same 
program  executing  on  any  similar  computer.  In  this  ease,  the 
benchmark  programs  arc  all  written  in  the  C  language  17].  and 
their  call/rcturn  trace  is  obtained  from  their  execution  on  a 
VAX  1 1  /780.  The  assembly  code  produced  by  the  C  compiler 
is  processed  by  an  editor  script  which  inserts  calls  to  special 
procedures  before  and  after  each  procedure  call  instruction. 
When  the  program  is  executed,  in  addition  to  producing  its 
normal  output,  it  creates  a  file  containing  a  siring  of  bits.  The 
ith  bit  in  the  string  corresponds  to  the  ith  call/rcturn  executed 
by  the  program.  This  bit  is  I  if  a  call  was  executed.  0  if  a  return 
was  executed.  The  bit  string  is  the  call/rcturn  trace  of  the 
program.  Routines  which  simulate  different  strategies  for 
managing  the  register  file  use  this  string  to  obtain  the  number 
of  ovcrflows/undcrflows  and  the  resulting  memory  traffic 
which  will  occur  if  the  benchmark  program  is  executed  using 
the  simulated  strategy. 

For  this  study,  three  benchmark  programs  were  used: 

rcc  The  RISC  C  compiler  [2]  which  is  based  on 

Johnson's  portable  C  compiler  [6],  The  call/ 
return  trace  used  was  generated  by  the  com¬ 
piler  compiling  the  UNIX  file  concatenation 
utility  cat.  88  606  calls  and  returns  were  exe¬ 
cuted  and  a  nesting  depth  of  26  was  reached. 

puzzle  This  is  a  bin-packing  program  which  solves  a 
three-dimensional  puzzle.  It  was  developed  by 
Forest  Baskett.  During  the  execution  of  the 
program.  42  710  calls  and  returns  were  exe¬ 
cuted  and  a  nesting  depth  of  20  was  reached 

tower  This  is  a  Tower  of  Hanoi  program.  The  call/ 
return  trace  used,  was  obtained  for  the  pro¬ 
gram  moving  18  disks.  I  048  574  calls  and 
returns  were  executed  and  a  nesting  depth  of 
20  was  reached. 

In  this  paper,  the  cost  of  handling  register  file  overflows  and 
underflows  is  assumed  to  be  directly  proportional  to  the 
number  of  RISC  instructions  they  require.  If  no  calculation 
is  needed  in  order  to  determine  the  number  of  frames  to  be 
moved,  the  cost  of  responding  to  the  interrupt  is  approximately 


30  instructions  (<v  =  30  in  the  above  discussion)  The  cost  of 
moving  one  STACK  I  frame  is  16  instructions  (d  =  16  in  the 
above  discussion). 

B  The  Cost  of  "Fixed"  Strategies 

The  simplest  strategy  for  managing  the  register  file  is  to 
always  move  the  same  number  of  frames  (say  /')  to  memory, 
w  hen  an  overflow  occurs,  and  alw-ays  move  the  same  number 
of  frames  (say/)  from  memory,  when  an  underflow  occurs. 
For  a  register  file  that  can  hold  w  frames,  such  a  strategy  will 
be  denoted  fixed(i.j)  where  /  and  j  arc  integers  such  that  I  < 
i  <  w  and  I  <  /  <  w  , 

When  a  fixed  strategy  is  used,  no  compulation  is  required 
in  order  to  determine  the  number  of  frames  to  be  moved 
Hence,  the  equation 

cost  =  30  X  (number  overflows  +  number  underflows) 

+  16  X  (number  frames  moved) 

is  used  to  evaluate  the  cost  of  managing  the  register  file.  This 
equation  is  also  used  in  evaluating  the  cost  of  the  optimal 
strategy,  which  serves  as  a  lower  bound  on  the  cost  of  other 
strategics. 

1 )  Measurement  Results:  The  actual  ''performance”  of  the 
optimal  strategy  and  fixed  strategies  is  presented  in  this  sec¬ 
tion.  All  possible  fixed  strategies  for  register  files  containing 
3,  5.  7,  9,  13,  and  17  register  banks  have  been  tried  with  the 
three  benchmark  programs. 

Tables  I -1 1 1  summarize  the  results  for  each  one  of  the  three 
benchmark  programs  with  six  different  register  file  sizes  and 
for  seven  different  strategics.  The  results  include  the  number 
of  overflows,  number  of  underflows,  memory  traffic,  and  cost. 
For  the  optimal  strategy,  the  "raw"  numbers  arc  presented. 
For  the  other  six  strategies,  the  figures  shown  arc  normalized 
with  respect  to  the  corresponding  entries  for  the  optimal 
strategy  with  the  same  register  file  size.  In  the  three  tables  w 
denotes  the  number  of  register  banks  in  the  register  file. 

The  fixed  strategics  included  in  the  tables  arc:  the  best  of 
all  fixed  strategies  (i.c.,  the  strategy  resulting  in  the  least  cost) 
for  the  particular  program  and  register  file  size,  the  worst  of 
all  fixed  strategies  (i.c..  the  strategy  resulting  in  the  greatest 
cost)  for  the  particular  program  and  register  file  size ,  ftxed(w. 
1)  which  guarantees  the  minimum  number  of  overflows. 
fixed(\,  w)  which  guarantees  the  minimum  number  of  un¬ 
derflows,  ftxed{  1 . 1 )  which  guarantees  the  minimum  memory 
traffic,  and  ftxed(\w/2 ],  f )  which  is  “symmetrical." 

2)  Discussion  of  Measurement  Results:  Although  the  three 
benchmark  programs  used  arc  quite  different,  the  results  show 
many  common  characteristics  in  their  behavior,  as  far  as  the 
management  of  the  register  file  is  concerned.  In  addition,  the 
results  for  the  ftxed(w,  1 ),  fxed(  I ,  *•).  and  fixed(  I .  I) 
strategies  provide  an  experimental  verification  to  the  fact  that 
the  “optimal  strategy,"  presented  in  Section  II,  does  indeed 
minimize  the  number  of  ovcrflows/underflows  and  memory 
traffic  simultaneously. 

The  register  file  size  and  the  way  that  the  register  file  is 
managed  can  significantly  affect  the  cost  of  procedure  calls 
Table  IV  shows  the  average  number  of  instructions  pcr  pro¬ 
cedure  call  required  for  managing  the  register  file.  For  every 
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call  there  is  a  corresponding  return.  Hence,  in  this  context, 
“procedure  cal!"  includes  returning  from  the  procedure  as  well 
as  invoking  it. 

The  data  indicate  that,  even  with  the  optimal  strategy,  the 
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cost  of  managing  the  register  file  may  become  prohibitive  if 
the  register  file  is  too  small  (three  register  banks).  In  this  case, 
for  two  out  of  the  three  programs  (rcc  and  tower),  it  is  likely 
that  a  conventional  stack  mechanism  for  handling  procedure 
calls  would  have  resulted  in  better  performance.  If  a  larger 
register  file  is  used,  the  cost  of  managing  the  register  file  drops 
sharply.  The  results  indicate  that,  for  a  register  file  of  five  or 
more  register  banks,  this  scheme  compares  favorably  with  the 
conventional  stack  mechanism. 

Invoking  a  high-level  language  procedure  and  returning 
from  it  requires  several  R  ISC  instructions  in  addition  to  those 
used  for  managing  the  register  file.  Specifically,  arguments 
have  to  be  copied  to  the  area  of  overlap  between  the  current 
STACK  I  frame  and  the  next  STACK  I  frame:  if  the  procedure 
returns  a  value,  it  may  have  to  be  copied  from  this  overlap  area: 
the  stack  pointer  and  frame  pointer  for  STACK2  may  need 
to  be  updated;  the  actual  RISC  call  and  ret  instructions  must 
be  executed.  C  procedures  typically  have  less  than  four  argu¬ 
ments  [5).  Hence,  in  addition  to  the  RISC  instructions  that 
manage  the  register  file,  between  three  and  seven  instructions 
will  be  executed  for  each  procedure  call/rcturn  pair 
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If  an  efficient  strategy  (such  as  the  "best  fixed  strategy”) 
is  used,  the  cost  of  managing  the  register  file  decreases  as  the 
number  of  register  banks  in  the  register  file  increases.  Once 
this  cost  reaches  approximately  one  RISC  instruction  per 
procedure  call/rcturn  pair  (e.g.,  using  the  “best  fixed  strategy” 
with  a  register  file  containing  nine  register  banks),  it  no  longer 
dominates  the  total  number  of  instructions  required  for  each 
procedure  call/rcturn.  In  a  single  chip  VLSI  microprocessor, 
chip  area  is  a  precious  resource.  Rather  than  adding  more 
register  banks  (e.g.,  beyond  nine),  the  limited  chip  area  can 
be  used  more  effectively  for  other  purposes,  such  as  an  on-chip 
cache  or  hardware  support  for  multiply,  that  arc  likely  to  make 
a  greater  contribution  to  overall  processor  performance.  Even 
for  the  benchmarks  used  here,  which  reach  a  relatively  high 
nesting  depth  [5],  a  register  file  with  between  five  and  nine 
register  banks  seems  optimal. 

Choosing  a  “good”  strategy  is  critical  to  the  success  of  the 
register  file  scheme.  Tables  II  and  III  show  that  choosing  the 
"wrong”  strategy  can  result  in  more  than  four  orders  of 
magnitude  increase  in  the  cost  of  managing  the  register  file. 
Furthermore,  if  an  inefficient  strategy  is  used,  an  increase  in 
the  register  file  size  can  result  in  an  increase  in  the  cost  of 
managing  the  register  file  (since  there  is  an  opportunity  to 
generate  more  useless  memory  traffic).  In  most  eases,  the  best 
fixed  strategy  is  to  minimize  the  memory  traffic  (i.c.,  use  the 
fixed( I,  1)  strategy).  This  can  be  explained  by  the  fact  that 
the  cost  of  moving  one  frame  to  memory  and  then  from 
memory  back  to  the  register  file  is  about  the  same  as  the  cost 
of  handling  the  trap  when  an  overflow  or  underflow  occurs. 
Hence,  the  immediate  cost  of  unnecessarily  moving  a  frame 
(which  results  in  one  frame's  worth  of  traffic  to  memory  and 
later  back  to  the  register  file)  is  about  equal  to  the  cost  of  not 
moving  a  frame  when  it  should  have  been  moved  (an  extra 
overflow  or  underflow  trap).  In  addition,  if  an  unnecessary 
move  is  made,  the  cost  may  include  the  cost  of  an  extra  over¬ 
flow  or  underflow  which  will  occur  later.  Hence,  the  “penalty” 
for  moving  one  more  frame  than  necessary,  when  an  overflow 
or  underflow  occurs,  is  greater  than  the  “penally”  for  moving 
one  fewer  frame  than  necessary.  Thus,  if  the  call/relurn  se¬ 
quence  is  random,  the  best  fixed  strategies  arc  likely  to  be  those 
that  require  the  movement  of  only  one  or  two  frames  when  an 
overflow  or  underflow  occurs.  The  use  of  such  strategics  is 
further  supported  by  the  fact  that  with  the  optimal  strategy, 
in  cases  where  there  are  more  than  ten  overflows/underfiows 
throughout  the  execution  of  the  program,  the  average  number 
of  frames  moved  when  an  overflow  or  underflow  occurs  is  be¬ 
tween  1.4  and  3  and  in  most  cases  is  approximately  2. 

C.  Taking  the  Past  into  Account 

The  fixed  strategies  do  not  attempt  to  take  into  account  the 
previous  behavior  of  the  executing  program.  It  is  conceivable 
that  a  strategy  that  does  take  past  behavior  into  account  would 
result  in  a  lower  cost,  closer  to  that  of  the  optimal  strategy. 

One  way  of  "taking  the  past  into  account"  involves  keeping 
track  of  which  register  banks  have  been  used  since  the  last 
overflow  or  underflow.  If  two  or  more  STACKI  frames  are 
moved  whenever  an  overflow  or  underflow  occurs,  it  is  clear 
that,  in  some  eases,  it  will  turn  out  that  too  many  frames  will 
be  moved,  resulting  tn  unnecessary  memory  traffic.  When  an 


overflow  occurs,  register  banks  arc  "freed"  by  copying  their 
contents  to  memory.  If  some  of  the  freed  register  banks  remain 
unused  until  the  next  underflow,  their  contents  remain  intact 
and  need  not  be  copied  from  memory  to  the  register  file 
Similarly,  if  too  many  register  banks  arc  loaded  when  an  un¬ 
derflow  occurs,  the  contents  of  those  that  are  unused  until  the 
next  overflow  need  not  be  copied  to  memory  since  their  con¬ 
tents  arc  already  in  the  appropriate  memory  locations. 

Many  practical  strategies  result  in  unnecessary  memory 
traffic,  i.c.,  more  memory  traffic  than  is  required  by  the  opti¬ 
mal  strategy.  The  above  technique  reduces  the  memory  traffic 
resulting  from  any  such  strategy.  Our  measurements  indicate 
that  with  the  useless  "worst  fixed  strategy,"  which  produces 
an  exorbitant  number  of  unnecessary  moves  of  STACKI 
frames,  keeping  track  of  which  register  banks  arc  used  can 
reduce  this  memory  traffic  by  up  to  an  order  of  magnitude. 
However,  with  "reasonable”  strategies,  the  gains  arc  less  im¬ 
pressive.  If  the  “best  fixed  strategy”  is  fixed(\ ,  I )  then  clearly 
no  gain  is  possible.  With  the  fixed(2y  2)  strategy,  the  decrease 
in  memory  traffic  is  less  than  ten  percent.  The  above  technique 
requires  some  extra  hardware  and  a  few  more  instructions  in 
the  trap  handling  routine.  When  the  overhead  of  these  extra 
instructions  is  taken  into  account,  the  total  cost  of  managing 
the  register  file  for  the  fixed(2,  2)  strategy  is  about  the  same 
as  without  this  extra  mechanism.  For  \hefixed(]s  I)  strategy  , 
the  extra  instructions  will  simply  add  to  the  cost  of  managing 
the  register  file  without  any  saving  in  memory  traffic. 

We  have  investigated  two  other  methods  for  “taking  the  past 
into  account.”  They  both  involve  determining  the  number  of 
frames  to  be  moved  when  an  overflow  or  underflow  occurs 
based  on  the  previous  behavior  of  the  program.  The  first 
method  (henceforth  denoted  C/R)  is  to  use  the  call/rcturn 
trace  immediately  preceding  the  overflow  or  underflow.  The 
second  method  (henceforth  denoted  O/U)  is  to  use  the  trace 
of  overflows  and  underflows  which  preceded  the  trap  being 
handled. 

The  C/R  method  can  be  implemented  by  adding  a  special 
shift  register  to  the  processor.  Every  call  instruction  shifts  a 
1  into  the  register  and  every  return  shifts  a  0.  The  routine 
which  handles  the  overflow/underflow  trap  examines  the 
contents  of  this  register  and  determines  the  immediately  pre¬ 
ceding  call/return  trace  of  the  program.  This  pattern  is  used 
to  access  a  table  containing  the  “optimal"  number  of  frames 
that  should  be  moved,  given  a  particular  call/return  pattern. 
This  scheme  adds  very  few  instructions  to  the  cost  of  handling 
the  overflow/underflow  trap. 

The  O/U  method  does  not  require  any  additional  hardware. 
The  “ovcrflow/undcrflow  trace"  is  kept  in  a  fixed  memory 
location  and  is  updated  each  time  an  overflow  or  underflow 
occurs  by  the  routine  that  handles  these  traps  The  pattern  in 
this  memory  location  is  used  in  the  same  way  as  the  contents 
of  the  shift  register  for  the  C/R  method. 

Both  the  C/R  method  and  O/U  method  require  finding  a 
mapping  between  "call/return  patterns"  or  “overflow /un¬ 
derflow  patterns"  and  “number  of  frames  to  be  moved"  so  that 
the  total  cost  is  reduced.  In  order  to  find  such  a  mapping  (for 
either  one  of  the  methods)  we  tabulated  the  optimal  number 
of  frames  to  be  moved  (which  can  be  found  given  unbounded 
look-ahead)  following  various  call/rcturn  or  overflow /un 
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derflow  patterns  for  the  three  benchmark  programs.  We  at¬ 
tempted  to  use  these  tables  to  determine  which  patterns  indi¬ 
cate  that  a  single  frame  should  be  moved  and  in  which  cases 
moving  more  than  one  frame  would  be  preferable.  However, 
we  could  not  find  a  single  mapping  which  worked  bet  ter  than 
the  fixed ( 1.  1)  strategy  for  all  three  programs! 

For  the  three  benchmark  programs  used  in  this  work,  it 
appears  that  the  optimal  number  of  frames  to  be  moved  is.  for 
all  practical  purposes,  independent  of  the  immediately  pre¬ 
ceding  call/return  pattern  of  length  ten  or  less.  The  O/U 
method  shows  more  promise  but  the  results  are  inconclusive. 
Following  a  suggestion  by  Denning  [3],  we  tested  an  O/U 
method  which  involved  moving  two  frames  after  two  consec¬ 
utive  overflows  or  underflows  and  moving  one  frame  otherwise. 
For  register  file  sizes  of  interest  (between  five  and  nine  frames), 
the  cost  of  managing  the  register  file  using  this  method  was 
compared  to  the  cost  using  the  fxed{ 1. 1)  strategy.  Reductions 
of  up  to  28  percent  in  the  number  of  overflows  and  underflows 
and  increases  of  up  to  59  percent  in  the  memory  traffic  were 
measured.  When  the  extra  instructions  in  the  trap  handling 
routines  are  taken  into  account,  the  overall  cost  was  either 
equal  to  or  greater  than  the  cost  of  the  fxed(  1,1)  strategy  in 
all  but  one  case. 

IV.  Conclusions 

The  success  of  the  RISC  architecture  is  due,  in  part,  to  the 
reduction  in  the  number  of  memory  accesses  which  is  possible 
through  the  use  of  the  register  file  [II].  We  have  shown  that 
the  effectiveness  of  the  register  file  is  dependent  on  choosing 
the  “right’'  size  for  the  register  file  and  an  efficient  strategy 
for  deciding  how  many  frames  should  be  moved  to/from 
memory  when  an  overflow/underflow  occurs. 

Our  measurements  indicate  that  with  the  simple  fixed 
strategy, /?x«/(  I,  1).  the  cost  of  managing  the  register  file  is 
within  a  factor  of  two  of  the  cost  of  the  optimal  strategy  (which 
requires  unbounded  look-ahead).  For  a  register  file  containing 
more  than  eight  register  banks,  the  fxed( 2,  2)  strategy  yields 
slightly  better  performance. 

If  a  “reasonable”  strategy  is  used,  the  cost  of  managing  the 
register  file  is  inversely  proportional  to  its  size.  If  the  register 
file  is  too  small,  the  number  of  overflows  and  underflows  be¬ 
comes  prohibitively  large.  Since  the  STACK  1  frames  have  a 
fixed  size,  the  large  number  of  overflows  and  underflows  results 
in  a  lot  of  memory  traffic  even  when  the  number  of  registers 
actually  used  (for  arguments  and  local  variables)  is  small. 
Hence,  if  the  register  file  is  too  small,  the  overall  cost  of  pro¬ 
cedure  calls  may  be  greater  than  if  a  conventional  stack 
mechanism  is  used.  Our  measurements  indicate  that  if  the 
register  file  contains  five  or  more  frames,  the  use  of  the  register 
file  scheme  rather  than  a  conventional  stack  mechanism  is 
worthwhile. 

We  have  attempted  to  use  past  behavior  of  the  program  in 
order  to  predict  the  future  behavior  and  reduce  the  cost  of 
managing  the  register  file.  So  far,  our  attempts  have  not  suc¬ 
ceeded. 

The  first  method  (keeping  track  of  which  register  banks  have 
been  used  since  the  last  overflow  or  underflow),  reduces  the 
cost  of  managing  the  register  file  only  for  inefficient  strategies. 


For  efficient  strategies,  such  as fxed(\%  \  )orfxed(2,  2).  the 
extra  overhead  in  the  trap  handling  routine  was  greater  than 
the  savings  from  the  reduced  memory  traffic. 

The  two  other  methods  attempt  to  determine  the  “optimal” 
number  of  frames  to  be  moved  from  the  immediately  preceding 
pattern  of  calls/returns  or  overflows/underfiows.  These 
methods  appear  ineffective  since  we  could  not  find  a  single 
mapping  between  either  type  of  patterns  and  number  frames 
to  be  moved,  which  reduces  the  cost  for  all  three  programs. 
These  results,  while  preliminary,  raise  serious  doubts  that  a 
mapping  which  reduces  the  cost  of  managing  the  register  file 
for  a  majority  of  programs  could  be  found.  Even  in  this  context, 
the  simplest  solution  appears  to  also  be  the  best. 

Appendix 

Proof  ok  Lemmas  3-6 
Lemma  3:  If  A'  >  1.  then  for  all  k ,  I  <  k  <  A, 

RFMc,[l.//u  +  1]  >  k. 

Proof:  By  induction  on  k. 

Basis:  k  =  I.  It  is  shown  that  RFMp[l,  +  1]  £  1. 
From  the  algorithm. 

max  (£|)  —  min  (£i)  <  w 

while 

max  (Ei  u  (</„„+ 1|)  -  min  (£,  u  K,l+i|)  >  w. 
Hence,  either 

dmi+t  <  min  (£,) 
or 

</„,,+  !  >  max  (f  |). 

By  Definition  1.  d\  -  1  and  d ,  >  1  for  all  ;,  1  <  /  <  n. 
Hence, 

d\  =  min  (£(  u  |</mi+i|)  =  min  (£i) 
and 

d»i\+\  ~  max  (£|  u  Mmi+il)- 

Thus, 

dnn+i  -d,>w. 

Since  Q  is  a  valid  RFPS  for  £>,  qt  <  dt  <  q\  +  wand  qm[+  , 
<  d„„  + 1  <  q„, l+i  +  w.  dm,+  l  >d i  +  tvandrf,  >  <71  imply  that 
+ 1  ^  q\  +  w  But  <7,„,  +  i  +  w  >  dmt+ 1.  Hence, 

<?».,+ 1  +  w  >  <?i  +  w, 

i.e.,  <7mi  +  i  >  <7i-  The  fact  that  q„,,  +  i  ^  171  implies  that  at  least 
one  RFM  occurs  in  the  location  range  [2,  mi  +  I),  so 
RFMp[l.m,  +  1]  >  1. 

Inductior  Step:  Assuming  that  this  lemma  holds  for  k 
=  n  -  I .  where  2  <  n  <  K,  it  is  now  proven  that  it  holds  for  k 
=  a.  In  other  words,  assuming  RFMp[  I ,  m„_  1  +  1)  >  or  - 
I,  it  is  proven  that  RFM^Jl,  m„  +  I]  >  a: 

If  RFMpfl,  m„_i  +  I)  >  a  -  I,  then  either 

RFMp[l .  m„- 1  +  1  ]  >  a 
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or 

RFM^[  I ,  Hi(t_ i  +  I  ]  =  a  —  I . 

The  former  case  implies  that  RFMy[  I .  mi,,  +  I  ]  >  a  (since 
/«„+  I  >«/,,_!+  I )  and  the  lemma  is  proved.  Hence,  wc  can 
assume  RFM^[  I ,  #m„_ i  +  I  ]  =  a  —  I . 

The  rest  of  the  proof  is  similar  to  the  proof  of  the  basis: 
From  the  algorithm. 

max  (£„)  -  min  (£„)  <  u 

while 

max  (£„  u  \d„,n+)\)  -  min  (£„  u  |r/,,,„+il)  >  *v. 

Hence,  cither  dniii+l  <  min  (£,, )  or  1  >  max  (£„). 
Assume  d„,o+ 1  <  min  (£„): 

From  the  algorithm  and  the  definition  of  'I/.  -  d,„n+ 1 

>  w.  Since  Q  is  a  valid  RFPS  for  D, 

q^„  £  */<!>,,  <  </+,,  +  >v 

and 


B\  Definition  I,  |r/,.,t+i  —  d,„k\  =  I .  Since  d,„k  €  Ek.  cither 
d„n+ 1  =  min  ( £* )  —  I  or  =  max  (£*  I  +  I  Hence. 

max  (£ a  u  \dlllk+]\)  -  min  (Lk  u  \d,„k+[\) 

=  max  (£t )  —  min  (£* )  +  I 

Thus,  max  (£* )  —  min  (Ek )  >  tv  -  I.  But  from  the  algorithm, 
max  (£* )  -  min  (£* )  <  u  —  I  Hence. 

max(£t )  —  min  ( Ek )  =  u  —  I. 

i.c.,  d*k  —  d^k  =  vi  —  I . 

■ 

Lemma  5:  If  K  >  I,  for  all  k.  I  <  k  <  K  —  I ,  for  all  i. 

mk _ i  +  I  <  i  <  mk.p,  -  d,H  =  r/+(  -  «•  +  I. 

For  the  last  subsequence,  i.c..  k  =  K:  if  0*  =  <1'*,  then  />,  = 
d$k,  else/?,  =  d+k  —  w  +  I. 

Proof:  For  all  /.  I  <  i  <  n,  the  value  of  p,  is  set  in  step  5 
or  in  step  7  of  the  algorithm. 

If  I  <  A  <  A  —  I .  then  by  Lemma  4. 


</»/„+ 1  —  d,„"+ 1  <  </.»,,+ 1  +  w. 

Hence, 

7*,,  +  W  >  -  d»l„+  I  +  H’  >  </,,,„+!  +  H\ 

i.c.,  >  </»>,,+ 1- 

Assume  </„,„+ 1  >  max  (£„): 

From  the  algorithm  and  the  definition  of  4>. 

dm„+ 1  ~  d$u  *v. 

Since  Q  is  a  valid  RFPS  for  D, 

<  d\„  <  q +  w 

and 

q»i„+t  —  d,„n+ 1  <  q m„  + 1  +  n  . 

Hence. 

<7m„+  !  +  **•>  <£,,„+  I  >  rf*,,  +  u  >  q*n  +  w. 

■•e-  7m„+  I  ^  <7 <*>„• 

The  fact  that  ^  q„,„+\  (<?*„  ^  <?m„+i)  implies  that 
there  is  at  least  one  RFM  in  the  location  range  [S^n  +  1. 

+  !](($„+  I,HI„  +1])  But  >  hi,,—  i  +  I  ($,,  >  HI,,  —  j 
+  I ).  Hence,  there  is  at  least  one  RFM  in  the  location  range 
[hi„_ i  +  2.  hi,,  +  I  ],  i.e..  RFM@[hi„_i  +  2, m„  +  I ]  >  I .  But 
by  assumption  RFM@[I,  hi0_i  +  I]  =  «  -  I.  Hence. 
RFMp[  I .  Hi"  +  I  ]  >  rv. 

■ 

Lemma  4:  If  K  >  I .  then  for  all  A,  I  <  k  <  K  —  I . 
d  k  —  d$k  ~  r  I . 

Proof  From  the  algorithm. 

max  ( Ek )  -  min  (£*)  <  »v 

while 

max  ( Ek  u  |</m,+  il)  ~  min  (Ek  u  ^  M- 

Hence, either  d„,k+\  <  min  (Ek)  or  rf,,„+i  >  max  (£*). 


d k  d t|.j,  H*  I . 

Hence. 

d.\,k  —  </+,  —  u‘  +  I 

and  the  same  value  (d,\,k)  will  be  assigned  to />,  in  step  5  or  step 
7  of  the  algorithm. 

If  k  -  K.  then  it  may  be  the  case  that  r/+,  —  d^.k  <  w  —  I . 
Hence,  it  may  make  a  difference  whether  the  value  of  p,  is 
assigned  in  step  5  or  in  step  7.  This  is  controlled  by  the  value 
ofCV 

If  0*  =  ‘I'a  .  then,  by  the  definition  of  (), 
r/.|.,  <  d<frk_y 

Since  k  —  I  <  K. 

P>'n-\  ~  P'H-i  =  d' i’j_i- 

Hence. 

min  (£*)  < 

and  the  second  clause  in  step  4  of  the  algorithm  is  satisfied. 
Thus,  p,  (mk  _  i  +  I  </<  m/a)  is  assigned  a  value  in  step  5  of 
the  algorithm.  So 

p,  =  d.H. 

If  0*  =  ,  then,  by  the  definition  of  (). 

d-H  ^  d*k_,. 

Since  A  —  I  <  A'. 

Pun- 1  =  P- ta-i  =  r/.|,_|. 

Hence, 

min  (Ek )  >  p,in.t. 

and  the  second  clause  in  step  4  of  the  algorithm  is  nor  satisfied 
Since  A  =  A',  mk  =  //.  and  the  first  clause  in  step  4  of  the  al¬ 
gorithm  is  also  nol  satisfied.  Thus,  p,  (hi;-i  +  1^/5  mk) 
is  assigned  a  value  of  step  7  of  the  algorithm.  So 
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pt  =  t/*.  —  w  +  I  . 

■ 

Lemma  6:  If  K  >  I.  then  for  all  A.  I  <  k  <  K. 
MTy|l.(M  >  MT>[  !.<>*]  +  k.u 
Proof-  By  induction  on  k 

Basis:  k  =  I.  It  is  shown  that  MTy[l,9|]  >  MTt*(I.9,] 

+  |(/(l|  ~  />«>, | 

By  the  dcllnition  of().  0|  =  I.  Hence, 

VITyll.O,]  =  MTC,(I.  I|  =  \q,~q0\ 
and 

MT/.1 1 .  <),]  +  k„,  -/>,„|  =  MT/>[  I.  I]  +  k.  ~/>il 

=  |/>i  -  />o|  +  ki  -  />i|. 
Bv  Definitions  2  and  5.  for  all  /.  I  <  i  <  n, 

I  <  q,  <  d,  <  q<  +  w 

and 

\  <  p,  <  d,  <  p,  +  w. 

By  Dcllnition  \.d\  =  I.  Hence.  q\  =  p\  =  I.  From  the  algo¬ 
rithm.  pn  =  I.  Hence, 

MT>(I.9|]  +  k„,  -/>«»,  |  =  \P\  —  Pol  +  ki  ~  Pi  I  =  0. 
Since  ki  -  </o|  i  0. 

MTq(I , (), ]  >0. 

Thus. 

MTyll.O,]  >  MT/»[1, 0|]  +  ko,-/>o,|. 

Induction  Step:  Assuming  that  this  lemma  holds  for  k 
=  u  —  I,  where  2  <  a  <  K.  it  is  now  proven  that  it  holds  for 
k  =  <r.  In  other  words,  assuming 

MTy(l, ()„_,]  >  MT/>[1 , 0„_|]  +  ko„., 

it  is  proven  that 

MTy[l.()„]  >  MT^[I,  ()„)  +  \qo„-po„\. 

From  Definition  8, 

MTy(l.()„]  =  MTy[l .  0„-,]  +  MTg[On_,  +  l,9„j 
and 

MT/>[  1 . 9„]  =  MT/>(  I .  ()„-,]  +  MTP[<>„-,  +  I.O,,]. 
Using  the  induction  hypothesis. 

MTy[l.(>„]  >  MT,[I. +  ko„., 

+  MTy(0„-,  +  1,0,,]. 

MTy(0„_i  +  I,  ()„J  is  the  number  of  STACK  1  frames 
transferred  to/from  memory  in  location  range  (0„-i  +  1,9,,]. 
A  change  by  one  in  the  RFP  indicates  that  one  ST  ACK1  frame 
is  transferred  to  or  from  memory.  Hence,  the  memory  traffic 
in  location  range  [0„_i  +  1 . 9„]  is  at  least  the  difference  be¬ 
tween  the  RFP  at  the  beginning  of  the  range  and  the  RFP  at 
the  end  of  the  range,  i.e., 

MTy(0„_|  +  l,9„]  >  I • 


Hence. 

MTy[I.O„]  >  MT/.[I.  ()„-,]  +  Ivi,..-, -Pn..,  I 

+  kl>..  -  VII..-|| 

From  Definition  8  and  the  algorithm, 

MT,|9„_,+  1.9..]-  E  |/>a-/’a-.| 

1 

=  £1  \Pd~Pii- 1|  km.,-i+l  1 1 

J-i 1 

<>„ 

■F  E  \p.l  ~  Pit-  1  I  =  km, .-l+l  ~ 

,i-m„- 1  +  : 

Since  it  —  I  <  K.  by  Lemma  5.  From  the  al¬ 

gorithm.  + 1  -  Pn,„-  Hence. 

MT/>|9„-|  +  1.9,.]  =  km.. 

Thus. 

MT/>(I.  9„]  =  MT^[I.  ()„_,]  +  . . . - 

In  the  rest  of  the  proof,  the  following  four  cases  will  be 
handled  separately: 

Case  A:  0„  =  <J>„  and  0„_(  = 

Case  B:  0„  =  4>„  and  9„_,  =  4,„_i 

Case  C:  9„  =  4',,  and 

Case  D:  9„  =  4',,  and  9„_|  =  4'„_|. 

Case  A:  9„  =  4>„  and  9,,-)  =  1*. 

ko..-i  -/>o„.||  +  ko„  ~  <7o„-il 

=  k+.-i _  /,+..-il  +  k+« -  </+..-il 
=  k*..-, _  ‘/♦..-J  +  k*..-i  ~  <7+,. I  - 

+  <7 >!>„-,  -  </♦,.  =  /’+„-|  “  <7l>„ 

**  ki-,,-1 —  p*„ )  +  ( p*„  —  <7  ■)•„)• 

By  Lemma  5.  since  ()„  =  4>„  and  tv  -  1  <  K,  p*a  =  t/*„  and 
P*„-,  —  Hence. 

ko„_,  -  /»«„.,  |  +  ko„  ~  I 

^  k*„_i  ~  ^+„)  +  (d*„  ~  <?♦„)• 
Since  Q  is  a  valid  RFPS  for  D , 

t/+„  <  d$n  <  41^,  +•  h’. 

Hence,  (t/*,,  -  <7^)  i  0.  Thus, 

k«..-i  ~  /»«»„-! I  +  ko„  ~  <7o„-il  ^  d*„-,  ~ 

From  the  definition  of  9.  since  9„  =  4>„,  </*„.,  >  d+n. 
Hence. 

d*„.\  ~  | -  </*„|. 

Thus. 

ko„_,  -  />«»,._! I  +  k<»„  -  <7o„.il  s  k4,„_,  -  tf+j 

Therefore. 

MTy[l.9„]  >  MT/»(1, 9„_,]  +  k*„.,  -tf* J 
By  Lemma  5.  since  9„  =  4><t,  p,„n  =  t/*„.  Hence. 

MT/>[1. 9„]  =  MTMl.9I,_l]  +  k*, 
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Thus. 

MTe[1.0„]  >  MT/.J1. 0„]. 

Case  B:  0„  =  4>„  and  0n_,  = 

ke„-i  ~Po„-il  + 

=  k+„-l  _  /’+„-|l  +  l</4>„  ~  fl+n-ll 

=  k*„-|  —  9+n-ll  +  I?*., -I  “  94^1  —  P*n-t 

—  </♦„_!  "h  ~  ~  P*a- 1  _  Q*a- 

From  the  algorithm, />+„_,  =  p$„.v  Hence, 

ko„-,  “  Po„.,|  +  ko„  “  I  ^  P*„-\  ~  <?4v 

The  rest  of  the  proof  for  this  case  is  identical  to  the  proof  of 
Case  A. 

Case  C:  0„  =  and  0„_,  = 

koa_,  ~  Po„_, |  +  ke,  _9o„-,| 

=  k*,,-.  ~/,4>„-il  +  -  <?+„-il 

-  <7+<,-l  ~  P*a- 1  +  <7*,.  ~  <1*,- 1  =  <7  +  .  ~  P*a- 1 

=  (<?♦„  “  P*„)  +  (/>*„  ~  /’*«- 1)- 

By  Lemma  5,  since  0„  =  'I/„  and  a  -  1  <  K ,  p$a  =  d -  w 
+  1  and  /?+„_,  =  </+„_!•  Hence, 

ke,-i  ~Po»-il  +  ke.  ~  <?o„-il  ^  (q*„  ~  d*n  +  w  -  1) 

+  (f7+„  —  +  1  — 

Since  Q  is  a  valid  RFPS  for  D , 

<7*.  <  <  q+a  +  hc 

Hence,  q+a  -  d*a  +  w  >  0,  so 

q+„-  d*a  +  *  ~  \  >  0. 

Thus, 

k©«-i  ~ Pe.-il  +  l<7o„  ~  qo..,\  >d*„-w+  1  - 

From  the  definition  of  0,  since  0„  =  'I'a,  d*a  >  d$a_v  From 
the  algorithm, 

max  (£„-i)  -  min  (£„-i)  <  w 

while 

max  (£„_|  u  km„.,+  il)  -  min  (£„_,  u  1|)  >  h-. 

Hence,  either  r/m„.,+  i  <  or  dm„.y+\  >  In  this 

case,  since  </$„  >  d$„_ ,  and  rfm„.,  +  i  ^  d$n,  it  must  be  true 
thatrfmo.,+  i  By  the  definition  of  'P,  d*„  >  dm„.,+ 1. 

Hence, 

d*«  >  d*..r 

By  Lemma  4,  since  a  —  1  <  K.  </*„_,  =  d*n_ ,  +  h-  -  1, 
Hence,  d*a  >  d$a_[  +  h  -  1,  so 

d*o  ~  *  +  1  -  d*n- 1  >  0- 

Thus, 

d+.  -  *  + 1  -  d*,-i  =  k*„  -  h'  + 1  -  i . 

By  Lemma  5,  since  0„  =  p„a  -  d*„  -  *»■  +  1 ,  Hence. 
ko.-i  -/’o„-il  +  ko.“  <7o„-,|  ^  \Pma  ~  dQn-\ I* 


Therefore, 

MT<,[1.0„]  >  MT/>[1. 0«, — i ]  +  | I • 

It  has  been  shown  above  that  MT/>[1. 0„]  =  MT/>[1.  0„-i] 
+  \Pm„  ~  Hence. 

MT^[1. 0„]  >  MT,[1.0J. 

Case  D:  0„  =  'I/„  and  0„_i  =  'P„_|. 

k<>„_,  —  +  ko„  I 

=  k*,.-l  _  P'&t<-\  I  +  k*n  —  I  I 

—  —  p*,,- 1  "F  qi>„  ~  q*„-i  =  q*„  ~  p*„~ i- 

From  the  algorithm.  />+„_,  =  Hence. 

ko„-i  “  /»«„_, I  +  k«»„  “  </<>.,-, I  ^  -  /’t.n-r 

The  rest  of  the  proof  for  this  case  is  identical  to  the  proof  of 
Case  C. 
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ABSTRACT 

A  large  amount  of  computer  time  is  used  for  the  solution  of  systems  of 
linear  equations  in  the  course  of  the  circuit  simulation  during  the  design  of 
integrated  circuits.  This  expenditure  limits  the  size  of  circuits  which  can  be 
practically  simulated,  and  results  in  poor  response  time  in  an  interactive 
environment.  In  order  to  increase  the  size  of  circuits  which  can  be  simulated, 
and  increase  the  response  time,  one  option  pursued  here  is  to  apply  concurrent 
computation  to  the  linear  equation  solution  aspect  of  circuit  simulation.  This 
concurrent  computation  will  exploit  inherent  parallelism  in  the  linear  equation 
solution  to  reduce  the  time  required  for  that  solution.  We  focus  on  one  particu¬ 
lar  method  for  solution  of  the  linear  equations:  LU  decomposition. 

While  LU  decomposition  has  a  great  deal  of  inherent  parallelism,  the  wide 
range  of  sparse  matrix  structures  requires  that  this  parallelism  be  detected 
automatically.  It  has  been  determined  that  the  overall  speedup  is  sensitive  to 
the  delays  between  cooperating  computational  elements,  and  the  manner  in 
which  the  concurrent  computations  are  mapped  onto  computational  elements  is 
therefore  of  importance.  The  approach  used  is  as  follows  :  Given  a  sparse  matrix 


x  wmfwtxm rrmmmmm  i  wrywmw. 


with  a  particular  structure,  a  code  generator  produces  a  program  representing 
the  LU  decomposition  for  that  matrix.  Another  program  detects  the  precedence 
constraints  among  the  sequential  instructions  in  the  code  and  models  the  solu¬ 
tion  process  as  a  directed  graph.  Based  on  this  graph,  scheduling  techniques  are 
employed  to  assign  segments  of  code  to  computational  elements  for  concurrent 
execution. 

Most  of  this  thesis  concentrates  on  the  last  problem,  finding  scheduling 
algorithms  which  reduce  the  sensitivity  of  the  solution  time  to  the  communica¬ 
tion  delay  among  computational  elements.  This  is  based  on  the  following  obser¬ 
vation.  With  zero  delay,  the  common  Hu’s  level  scheduling  algorithm  gives  good 
speedup  performance.  However  when  the  communication  delay  is  large  com¬ 
pared  to  the  execution  time  of  an  instruction  in  the  code,  considerable  degrada¬ 
tion  on  the  speedup  performance  is  observed  for  Hu’s  algorithm. 

Polynomial-time  optimal  scheduling  algorithms  appear  to  be  intractable. 
Hence  heuristic  algorithms  with  feasible  running  time  that  give  suboptimal 
schedules  have  to  be  constructed.  This  is  approached  in  two  different  ways. 
Heuristic  local  minimization  scheduling  algorithms  using  two  matching  algo¬ 
rithms  from  combinatorial  optimization  are  studied  and  promising  results  are 
obtained.  These  two  matching  algorithms,  minjnax  matching  and  weighted 
matching,  give  optimal  code-to-processor  assignment  at  each  time  step.  The 
second  approach  is  heuristic  global  minimization  using  a  clustering  technique. 
The  critical  (longest)  path  in  the  directed  graph  has  a  close  correlation  with  the 
completion  time.  The  idea  is  to  shorten  the  critical  path  by  clustering  nodes 
together  in  order  to  reduce  the  communication  delay. 

Experimental  results  are  given  for  both  approaches  based  on  the  solution  of 
a  set  of  sparse  linear  equations  based  on  actual  circuit  simulations. 
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CHAPTER  1 


INTRODUCTION 


1.1.  Motivation  and  the  Goal  of  this  Research 

In  the  domain  of  circuit  simulation,  practically  all  standard  circuit  simula¬ 
tors  such  as  SPlCEfl]  and  ASTAP[2]  contain  a  routine  for  solving  a  system  of 
linear  equations.  For  small  size  circuits,  most  of  the  cpu  time  is  spent  in  loading 
the  circuit  matrix  and  the  model  evaluations.  This  processing  time  grows 
linearly  with  the  size  of  the  circuits.  For  large  circuits,  it  has  been  observed 
that  a  large  portion  of  the  cpu  simulation  time  is  spent  in  solving  the  system  of 
linear  equations.  It  has  been  estimated  that  this  linear  equations  solution  time 
grows  as  0(  n*  )  where  n  is  the  size  of  the  circuit  measured  in  the  number  of 
circuit  elements  and  0  is  between  1.1  and  1.5.  Hence  for  large  circuits,  this  will 
become  the  dominant  cost  of  the  analysis. 

In  order  to  reduce  the  simulation  time  of  large  circuits,  decomposition  of 
the  original  problem  into  smaller  modules  for  parallel  computing  should  be  con¬ 
sidered.  These  modules  will  be  assigned  to  different  processors  for  concurrent 
execution.  Due  to  the  inherent  parallelism  in  the  solution  algorithm  of  the  sys¬ 
tem  of  linear  equations,  this  concurrency  technique  will  speed  up  the  solution 
process. 

In  the  most  general  case,  the  precedence  constraints  among  these  modules 
are  arbitrary.  In  this  case,  the  assignment  of  these  modules  to  processors  to 
obtain  a  schedule  which  has  the  shortest  possible  finishing  time  is  very  compli¬ 
cated.  Another  consideration  is  that  there  exists  a  nonzero  interprocessor  com¬ 
munication  overhead  in  any  realistic  distributed  computing  environment.  This 
overhead  will  tend  to  degrade  the  speedup  performance  of  the  multiprocessing 


system.  Hence  this  overhead  delay  makes  the  already  very  complicated  problem 
even  more  difficult.  In  this  dissertation,  the  goal  is  to  construct  some  heuristic 
scheduling  algorithms  taking  into  account  the  communication  delay  between 
processors,  so  that  the  finishing  time  of  a  schedule  is  as  small  as  possible. 

1.2.  Brief  Review  of  Standard  Circuit  Simulator 

In  the  design  of  integrated  circuits,  circuit  simulation  programs  such  as 
SPICE  [l]  and  ASTAP  [2]  are  shown  to  give  very  accurate  analysis  of  electrical 
characteristics  of  a  circuit.  With  these  computer-aided  design  tools,  circuit 
designers  can  verify  and  modify  the  circuit  to  get  a  better  design  with  great 
flexibility.  This  approach  virtually  replaces  the  traditional  breadboard  and  test¬ 
ing  approach  as  a  means  of  verifying  the  electrical  performance  of  the  final  cir- 


In  most  of  the  cases,  the  simulation  of  an  electrical  circuit  requires  three 
basic  types  of  analyses:  the  dc  analysis,  the  small  signal  ac  analysis  and  the 
transient  analysis  in  the  time-domain.  In  the  most  general  case  of  transient 
analysis,  the  dynamical  behaviour  of  the  circuit  is  described  by  a  system  of 
differential  equations 

f(i(f).  x(0.u(f))  =  0;  x(0)  =  ib  (1.1) 

where  x(f )  C  R"  is  the  vector  of  unknown  circuit  variables,  u(f)e  W*is  the  vector 
of  input  circuit  variables  and  their  time  derivatives,  Xoe  ^  is  the  given  initial 
value  of  x ,  and  f  is  a  vector-valued  continuous  function.  The  simulation  is  car¬ 
ried  out  through  a  sequence  of  discrete  timesteps  tt  ;  i  =  1,2 . N  chosen  by 

the  simulator  with  f0  =  0  and  tN  =  T  where  T  is  the  simulation  time  specified  by 
the  user. 

The  three  basic  procedures  used  to  solve  this  system  over  the  time  interval 
[0,  T]  are  :  an  implicit  numerical  integration  scheme,  the  Newton-Raphson  algo- 


rithm  and  the  solution  of  the  resulting  system  of  linear  equations.  The 
differential  equations  describing  the  reactive  elements  are  replaced  by  their 
corresponding  discrete  circuit  models  associated  with  an  implicit  integration 
algorithm  such  as  Backward  Euler 


or  Trapezoidal  rule 


x(ti)  « 


g(*i)  ~g(*i-i) 
*t  “  U-i 


(1.2) 


2 


g(*i)  -»(*<- 1) 
ti  -ti-i 


(1.3) 


At  this  stage,  we  have  a  resistive  network  consisting  of  linear  and/or  nonlinear 
elements.  The  nonlinear  resistive  elements  are  then  substituted  by  their  com¬ 
panion  models  using  Newton-Raphson  algorithm. 

I -*-*]»  .  (1.4) 

where  g(  )  is  the  branch  relation  describing  the  nonlinear  element  .  is 

the  partial  derivative  of  g(  )  evaluated  at  ,  and  k  is  the  Newton-Raphson 
iteration  count  The  resulting  network  contains  only  linear  resistive  elements. 
Modified  nodal  analysis  (or  Sparse  Tableau)  is  used  to  assemble  the  linear  circuit 
equations  and  the  solution  of  these  equations  is  sought  by  LU  decomposition  fol¬ 
lowed  by  forward  and  backward  substitutions  (Gaussian  Elimination).  The 
coefficient  matrix  obtained  is  usually  very  sparse  and  for  efficient  solution  of  the 
linear  circuit  equations,  sparse  matrix  techniques  are  employed  [3].  These 
procedures  are  repeated  for  each  time  step  until  the  whole  time  interval  of 
simulation  is  completed.  A  more  detail  analysis  and  description  of  the  above 
algorithms  are  found  in  [4].  The  time-domain  transient  analysis  can  be  sum¬ 
marized  in  the  flow  chart  shown  in  Figure  1.1 


It  has  been  observed  in  [5]  that  the  bulk  of  the  storage  and  computation  of 
the  simulation  lie  in  loading  the  modified  nodal  analysis  matrix  and  solving  the 
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linearized  circuit  equations.  In  particular,  for  very  large  circuits,  the  computa¬ 


tion  time  spent  in  the  solution  of  the  system  of  linear  circuit  equations  grows 


exponentially  as  n*  where  k  is  between  1.1  and  1.5  and  n  is  the  size  of  the  circuit 


measured  in  terms  of  circuit  components.  Hence,  for  cost-effective  use  of  the 


simulator,  the  simulation  is  usually  limited  to  circuits  of  a  few  hundred  devices. 


1.3.  Impact  of  Parallel  Processing  on  Circuit  Simulation 


As  the  results  of  improvements  in  fabrication  and  processing  technologies, 


we  are  moving  into  the  era  of  very  large  scale  integrated  (VLSI)  circuits.  The 


natural  consequence  of  this  is  the  increase  in  demand  for  simulating  these  VLSI 


circuits.  It  is  not  unusual  to  find  large  mainframe  computers  dedicated  solely  to 


circuit  simulation  in  some  major  integrated  circuit  design  houses. 


1.3.1.  Decomposition 


As  already  pointed  out  in  the  last  paragraph  of  section  1.2,  it  is  not  cost- 


effective  to  simulate  the  whole  circuit  on  a  single  computer.  Hence,  new  simula¬ 


tion  techniques  are  necessary  in  order  to  cope  with  the  problems  suffered  in  the 


standard  circuit  simulators.  A  survey  of  the  third  generation  simulator  algo¬ 


rithms  is  found  in  [6] .  These  algorithms  are  based  on  the  concept  of  decompo¬ 


sition  of  large-scale  systems.  Within  this  context,  decomposition  refers  to  the 


partitioning  of  the  problem  of  solving  a  system  of  equations  into  many  subprob¬ 


lems.  Each  subproblem  consists  of  a  subset  of  the  original  equations  and  the 


corresponding  variables.  The  solution  of  the  original  problem  is  obtained  by  con¬ 


sidering  the  interactions  between  these  subproblems.  Some  of  the  well  known 


decomposition  algorithms  are  Block  LU  Factorization,  the  Tearing  Algorithm,  the 


Multilevel  Newton-Raphson  Algorithm  and  Relaxation  Algorithm  (  see[7]  and[8]  ). 


In  this  dissertation,  we  have  another  decomposition  technique  which  is  different 


from  the  traditional  decomposition  algorithms  mentioned  above.  We  call  it  ”ele- 
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mental"  decomposition[9]  ,  which  exploits  the  parallelism  among  the  elemental 


operations  representing  a  particular  algorithm.  Here  the  decomposition  is  done 


on  the  individual  operations. 


1.3.2.  Advantages  of  Decomposition 


With  the  advances  in  the  VLSI  technology  and  the  price-performance 


improvements  in  hardware,  we  can  expect  to  be  able  to  perform  computation¬ 


ally  intensive  computation  on  a  multiprocessor  computing  system  in  the  near 


future.  The  realization  of  the  potential  offered  by  these  special  concurrent  com¬ 


puter  architectures  lies  on  the  development  of  new  parallel  computational  algo¬ 


rithms  such  as  the  ones  mentioned  above.  No  matter  which  decomposition  tech¬ 


nique  is  employed,  the  most  important  advantage  of  decomposition  is  the  capa¬ 


bility  to  use  concurrency  to  increase  the  size  of  the  circuit  to  be  simulated.  We 


envision  a  multiprocessor  system  consisting  of  a  set  of  processors  and  an  inter¬ 


connection.  Each  processor  has  its  own  memory  and  the  processors  can  only 


communicate  through  a  central  interconnection  network.  In  other  words,  all  the 


communication  functions  such  as  routing,  message  forwarding,  buffer  manage¬ 


ment  are  done  by  the  nodes  within  the  interconnection  network.  A  conceptual 


model  of  a  multiprocessor  system  is  shown  in  Figure  1.2.  The  advantages  of 


decomposition  in  the  context  of  parallel  processing  are  as  follows  : 


High  Speed  Capability 


The  subproblems  or  modules  of  a  decomposed  system  can  run  on  processors 


which  work  cooperatively  to  gain  high  speed  through  concurrent  execution  and 


hence  the  response  time  is  reduced.  An  additional  gain  in  speed  would  result 


from  the  smaller  memory  addressed  within  each  processor. [10]. 


Memory  Size 


As  pointed  out  in  section  1.2,  the  storage  required  for  a  simulation  of  VLSI  cir¬ 


cuits  is  very  large  and  hence  it  is  not  practical  to  simulate  the  whole  circuit  on  a 
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single  computer.  By  partitioning  the  original  circuit  into  modules,  these 
modules  will  be  treated  on  separate  processors  and  this  permits  much  larger 
circuit  to  be  simulated. 
latency 

One  of  the  advantages  of  a  distributed  computing  system  is  the  ability  to  share 
resources  available  at  various  computing  components.. By  exploiting  the  latency 
of  the  circuit  i.e.  avoiding  the  expense  of  simulating  parts  of  the  circuit  which 
are  not  active  at  a  given  point  in  the  simulation  time,  we  can  redirect  some  of 
the  computing  resources  available  for  other  useful  purposes. 

This  divide-and-conquer  approach  is  not  only  appealing  for  circuit  simula¬ 
tion,  but  it  also  has  applications  in  areas  such  as  real  time  command  and  con¬ 
trol,  data  base  management  and  real  time  signal  processing. 

1.4.  Simulation  Tool 

In  order  to  evaluate  the  speedup  performance  as  well  as  the  efficiency  of  a 
multiprocessor  system  when  a  particular  parallel  algorithm  is  executing  on  it, 
either  a  hardwired  multiprocessor  or  a  simulator  is  necessary.  Since  we  don’t 
have  a  multiprocessor,  a  discrete-time,  event-driven  simulation  program  has 
been  written  ([ll],  and  [12] )  This  simulator  is  named  SIMON  (Simulator  of  Mul¬ 
tiprocessor  Networks)  which  simulates  the  parallel  execution  of  a  set  of  user's 
programs  as  if  each  program  is  run  on  a  separate  processor. 

1.4.1.  The  Simulator.  SIMON 

The  simulator  consists  of  three  components  as  shown  in  Figure  1.3  :  the 
application  programs  or  equivalently  tasks  (processes),  the  simulator  base,  and 
the  switch  model.  The  application  programs  are  assumed  to  be  written  in  C  for 
easy  interfacing  with  SIMON.  The  simulator  base  time-multiplexes  the  execution 
of  the  processes  on  a  host  computer  which  is  a  VAX  1 1/780.  The  base  also  keeps 


track  of  the  time  for  each  task  to  ensure  that  the  interactions  among  these 
tasks  are  simulated  in  the  correct  time  sequence.  The  switch  model  models  the 
interconnection  network  and  it  facilities  the  easy  comparison  of  different  switch¬ 
ing  structures  by  having  this  switch  model  as  a  separate  module. 

1.4.2.  Important  Features  of  the  Simulator 

The  most  important  features  of  the  simulator  for  the  entire  simulation  can 
be  summarized  below: 

(1)  provides  statistics  such  as  blocked  time,  running  time  on  each  processor. 

(2)  gives  average  traffic  measured  in  terms  of  number  of  bytes  that  has  been 
exchanged  between  each  pair  of  processors. 

(3)  allows  the  user  to  model  the  speed  of  the  processor  by  specifying  the  exe¬ 
cution  times  of  assembler  instructions  (comparable  to  those  available  in 
the  VAX). 

(4)  permits  the  user  to  specify  a  constant  time  message  delay  between  any 
pair  of  communicating  processors. 

These  will  provide  crucial  information  about  the  speedup  performance  of  a 
parallel  algorithm  on  a  switching  structure,  so  that  the  user  can  tailor  the  topol¬ 
ogy  of  the  interconnection  network  and  is  able  to  design  an  appropriate  message 
routing  algorithm  for  specific  applications  such  as  circuit  simulation  and  signal 
processing. 

1.5.  Problem  Statement 

With  the  advantages  of  concurrent  processing  applied  to  the  parallel  algo¬ 
rithms  outlined  above,  we  attempt  to  implement  the  LU  decomposition  on  a  mul¬ 
tiprocessing  system  with  consideration  of  communication  delay  in  the  switching 
network.  The  main  objective  is  to  achieve  a  reduction  in  execution  time  when  we 


use  a  multiprocessor  approach  as  compared  to  a  single  processor.  This  time 
reduction  is  measured  in  terms  of  speedup  performance  which  is  defined  as  the 
ratio  of  the  completion  time  of  a  given  task  using  multiple  processors  to  the 
completion  time  using  single  processor.  In  this  dissertation,  the  communication 
delay  is  defined  as  the  time  elapsed  when  a  processor  sends  a  message  and  when 
the  message  arrives  at  the  destination  processor.  A  further  simplification  is  that 
a  constant  delay  is  assumed  for  any  communicating  processors  in  the  multipro¬ 
cessor  system. 

Previous  work[9]  showed  a  promising  speedup  performance  when  the  com¬ 
munication  delay  is  ignored  and  a  scheduling  algorithm  called  Hu’s  level 
scheduling  algorithm  is  employed.  However  when  the  communication  delay  is 
large,  it  will  be  shown  that  there  is  a  considerable  degradation  on  the  speedup 
performance.  Hence  the  problem  of  assigning  tasks  to  processors  taking  into 
account  the  delay  to  minimize  the  completion  time  is  of  main  concern  in  this 
research.  This  belongs  to  the  classical  problem  of  scheduling  theory  in  resource 
management.  In  the  context  of  this  problem,  polynomial-time  optimal  schedul¬ 
ing  algorithms  appear  to  be  intractable.  Heuristic  scheduling  algorithms  with 
feasible  running  time  that  give  suboptimal  schedules  have  to  be  constructed. 
The  performance  of  these  heuristics  in  terms  of  the  speedup  ratio  obtained  will 
be  compared  to  Hu's  level  scheduling  algorithm  in  the  presence  of  communica¬ 
tion  delay  in  the  interconnection  network. 

1.6.  Outline  of  the  Dissertation 

In  the  following  chapters,  the  approach  to  this  research  and  the  heuristic 
scheduling  algorithms  will  be  discussed  in  detail,  followed  by  a  concluding 
chapter  of  remarks  and  future  research.  More  precisely,  chapter  two  is  devoted 
to  LU  decomposition  and  our  approach  to  implementing  the  decomposition  on  a 
multiprocessor  system.  Chapter  three  will  give  a  detailed  description  of  Hu's 


level  scheduling  algorithm  applied  to  the  LU  decomposition  task  graph.  Then  the 
simulation  studies  of  Hu’s  level  scheduling  technique  are  discussed  in  the  pres¬ 
ence  of  communication  delay  in  the  switching  network.  In  chapters  four  and  five, 
heuristic  algorithms  taking  into  account  the  communication  delay  will  be 
presented.  With  respect  to  this  topic  of  scheduling  algorithms,  the  heuristic 
local  minimization  algorithms  using  the  min_inax  matching  and  weighted  match¬ 
ing  algorithms  are  discussed  in  chapter  four,  and  the  heuristic  global  minimiza¬ 
tion  algorithm  is  detailed  in  chapter  five.  In  both  chapters,  simulation  results 
are  presented  to  show  the  performance  improvement  of  these  heuristics  over 
Hu’s  level  scheduling  algorithm  in  the  presence  of  communication  delay. 


CHAPTER  2 


PARTITION  OF  LU  DECOMPOSITION  FOR 
CONCURRENT  PROCESSING 

2.1.  Introduction 

Practically  all  circuit  simulators  contain  a  routine  for  solving  a  system  of 
linear  equations  of  the  form 

Ax  =  b  (2.1) 

where  A  is  the  coefficient  matrix,  x  is  the  unknown  vector  of  circuit  variables 

and  b  is  the  excitation  vector.  In  circuit  simulation  as  well  as  in  other  engineer¬ 
ing  applications,  the  coefficient  matrix  is  very  sparse.  This  inherent  sparseness 
of  the  matrix  must  be  recognized  in  order  to  achieve  an  efficient  solution  of  the 
unknown  vector  x. 

There  are  basically  two  methods  of  solving  (2.1).  The  first  method  is  called 
the  iterative  method,  the  most  common  iterative  method  being  the  Gauss-Seidal 
method.  This  technique  involves  the  initial  guess  of  the  solution,  then  updating 
the  solution  vector  using  the  previous  iterated  solution  until  convergence  is 
obtained.  A  detail  analysis  of  the  convergence  properties  and  roundoff  error  can 
be  found  in[7]  .  The  second  method  is  called  direct  method,  and  solves  (2.1)  by 
multiplying  both  sides  of  the  matrix  equation  by  A-1  to  yield 

x  a  A-1  b  (2.2) 

provided  that  A~l  exists.  The  next  step  could  be  directly  invert  A  numerically. 

However,  an  operation  count  which  counts  the  number  of  multiplications  and 

divisions  required  to  find  the  inverse  of  A  shows  that  it  requires  three  times  the 

computational  effort  of  the  other  direct  methods.  So  the  method  of  finding  the 

inverse  of  Ain  solving  (2.1)  is  seldom  used  in  the  simulation  programs.  There  are 


two  equivalent  methods  of  direct  elimination  which  do  not  require  finding  the 
inverse  of  A.  Gaussian  elimination  and  LU  decomposition.  They  are  computation¬ 
ally  equivalent  in  the  sense  that  they  require  the  same  number  of  operations. 
However  for  multiple  input  vectors  b,  LU  decomposition  is  done  only  once  and 
the  required  solutions  can  be  found  subsequently  by  forward  and  backward  sub¬ 
stitutions.  This  method  will  be  discussed  in  more  detail  in  the  next  section.  For 
Gaussian  elimination,  the  solution  process  has  to  be  repeated  for  each  input  vec¬ 
tor  b.  Hence  LU  decomposition  is  the  sensible  choice. 

2.2.  Doolittle  Algorithm  for  LU  Decomposition 

UU  decomposition  factors  or  decomposes  the  original  coefficient  matrix  A 
into  the  product  of  a  lower  triangular  matrix  L  and  an  upper  triangular  matrix 
U.  Either  the  diagonal  terms  of  L  or  U  are  set  to  unity  depending  on  how  the 
decomposition  is  carried  out.  There  are  several  equivalent  methods  for  LU 
decomposition,  that  is  finding  the  L  and  U.  In  the  discussion  that  follows,  we 
present  the  Doolittle  algorithm!  13]  .  The  diagonal  elements  of  L  in  this  case  are 
set  to  unity. 

By  substituting  A  =  1XJ  into  the  original  equation  (2.1),  we  obtain 

Ax  =  LUx  =  b  (2.3) 

Let  y  =t  Ux,  then  the  original  system  of  linear  equations  is  equivalent  to  two 

derived  systems  of  linear  equations 

Ly  =  b  (2.4) 

•Ux  =  y  (2.5) 

The  method  for  finding  the  L  and  U  matrices  will  be  discussed  in  section  2.2.2. 

2.2.1.  Forward  and  Backward  Substitutions 

After  the  factorization,  the  solution  vector  b  can  be  obtained  by  two  succes¬ 
sive  substitutions:  the  forward  substitution  followed  by  the  backward  substitu- 
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tion.  From  (2.4),  yean  be  solved  easily  because  Lis  a  lower  triangular  matrix  by 
the  following  recursive  relation 

Vi  =  :  (2.6) 

Vi  ~  bi  hiVi  '•  J  ~  2-  3 . N 

•<*i 

The  above  process  is  called  forward  substitution.  After  solving  for  y  and  from 
(2.5),  x  can  also  be  solved  easily  by  the  following  recursive  relation 


*N=VN  :  (2-7) 

xi  =  (Vj  ~  £  «**«)  /  :  3  =  tf-1.  N-2. .  1 

<■/  +  ! 


This  is  called  backward  substitution.  Note  that  in  both  forward  and  backward 
substitutions,  neither  the  inverse  of  Lnor  the  inverse  of  U  is  necessary  in  finding 
the  solution  of  (2.1). 

2.2.2.  Doolittle  Algorithm 

The  Doolittle  algorithm  discussed  here  consists  of  two  alternating  steps.  The 
first  is  the  dividing  step  and  it  is  followed  by  the  updating  step.  At  each  compu¬ 
tation  step  of  the  algorithm,  the  dividing  step  is  performed  on  the  elements  in 
the  pivoting  column  and  the  updating  step  is  performed  on  all  the  element 
whose  row  index  and  column  index  are  greater  than  the  corresponding  pivot  row 
index  and  the  pivot  column  index.  The  detail  of  the  algorithm  is  as  follows: 

Given  a  matrix  of  size  N 

Initialization  :  k-1  sub-matrix  =  original  matrix  ; 

Step  1 : 

In  column  k  of  the  sub-matrix,  divide  all  nonzero  elements  in  this  column 
by  <***•(  dividing  step  ) 

a#  =  a p  /  a**  ;  j  =  fc+1,  k+  2 . ,N  (2.8) 


* 


Step  2 : 


For  each  nonzero  element  ay  in  the  sub-matrix  obtained  by  deleting  the 
first  k  rows  and  k  columns,  subtract  from  it  the  product  of  and  a^-.  ( 
updating  step  ) 


;  i  ,j  =  fc  +  l,  k+Z . N  * 

Step  3 : 

k  =  k  +  l;Vk  -  N  stop.  Otherwise  go  to  Step  1. 


(2.9) 


The  method  described  above  gives  an  in-place  LU  factorization.  The  ele¬ 
ments  in  L  are  obtained  from  the  resultant  matrix  A  as  follows  : 


0 

1 

i 


if  i  >  j  ; 
if  i  =  j  ; 
if  i  <  j  ; 


and  the  elements  in  U  is  obtained  as  follows  : 


-{7 


Ui  >j  ; 
if  i  ; 


Before  leaving  tlus  algorithm  and  giving  a  simple  example,  a  few  comments 
are  worth  mentioning  when  the  given  matrix  is  very  sparse.  The  first  comment  is 
that  the  trivial  operation  of  multiplying  a  number  by  zero  should  be  avoided.  In 
updating  an  element  ay  in  the  sub-matrix  (  see  equation  (2.9)).  if  either  or 
a*)  is  zero,  then  this  updating  operation  is  not  necessary. 

The  second  comment  relates  to  the  generation  of  fill-in  elements.  A  fill-in 
element  is  an  element  which  originally  has  a  value  equal  to  zero  but  assumes  a 
nonzero  value  during  the  LU  decomposition.  This  happens  in  the  updating  step 
when  is  zero  and  neither  nor  a*;  is  zero.  Because  of  the  propagation 
nature  of  the  fill-in  elements,  i.e.  fill-in  elements  generate  more  fill-in  elements, 
it  decreases  the  sparseness  of  the  original  coefficient  matrix.  Various  reorder¬ 
ing  algorithms  such  as  Markowitz,  Berry  and  Nahkla[14]  are  implemented  in  the 
simulation  programs  to  minimize  the  generation  of  these  fill-in  elements. 
Throughout  this  dissertation,  we  assume  that  this  reordering  has  been  done 
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before  performing  the  Doolittle  algorithm. 

Another  practical  aspect  of  the  Doolittle  algorithm  is  the  pivoting  for  accu¬ 
racy.  Pivoting  is  the  interchanging  of  rows  and  columns  so  that  the  element  with 


the  largest  magnitude  is  placed  in  the  pivot  position  along  the  diagonal  in  the 
matrix.  This  will  reduce  the  roundoff  error  introduced  in  the  dividing  step.  This 
will  not  be  carried  out  when  the  LU  decomposition  is  performed.  We  will  address 
this  point  at  the  end  of  section  2.3. 

In  order  to  understand  the  Doolittle  algorithm  in  more  detail,  the  following 
simple  example  will  serve  this  purpose.  Let 

2  13 
A  =  6  6  10 
4  11  7 

Steps  1  and  2  with  k  =  1  produce  the  following  matrix 

2  1  3 

3  3  1 
2  9  1 

Then  steps  1  and  2  with  k  ~  2  give  the  final  matrix 


2  13 

3  3  1 
2  3  -2 

and  the  corresponding  two  factors  can  be  written  down  immediately 


1  0  0 

2  1  3 

3  1  0 

U  = 

0  3  1 

2  3  1 

0  0  -2 

2.3.  Code  Generation. 


9 


9 


Since  the  routine  to  solve  the  system  of  linear  equations  is  called  many 
times  during  the  whole  time  interval  of  the  simulation,  the  idea  of  generating 
machine  code  for  solving  the  system  of  linear  equations  has  been  implemented 
in  SPlCEfl].  For  dc  and  transient  analysis  where  the  system  of  equations  is  real, 
a  nonlooping  set  of  machine  code  instructions  specific  to  a  particular  circuit  is 


generated  for  fast  execution  compared  to  a  FORTRAN  routine.  Since  much  of  the 
cpu  time  is  spent  in  addressing  when  executing  a  loop,  it  is  beneficial  to  gen¬ 
erate  non-looping  code  in  order  to  gain  execution  speed.  The  decrease  in  cpu 
time  is  well  worth  the  increase  in  memory  for  storing  the  additional  code.  This 
idea  is  well  suited  to  the  Doolittle  algorithm  because  it  is  basically  a  loop. 

Another  advantage  is  that  the  structure  of  the  coefficient  matrix  for  a  given 
circuit  is  fixed.  By  structure  we  mean  the  locations  of  the  nonzero  elements  in 
the  sparse  coefficient  matrix.  Hence  the  operations  (divide  and  update)  on  the 
nonzero  elements  are  pre-determined  for  a  given  circuit  throughout  the  LU 
decomposition.  In  other  words,  once  the  sequential  code  of  the  divide  and 
update  operations  of  the  LU  decomposition  is  generated,  it  can  be  used  for  the 
entire  circuit  simulation  involving  many  LU  decompositions.  This  is  called  sym¬ 
bolic  LU  factorization.  In  the  following  sections,  the  code  generator  is  described 
and  an  example  will  be  given  to  illustrate  this  concept. 

2.3.1.  The  Code  Generator 

A  computer  program,  the  code  generator,  was  written  to  generate  the 
sequential  instructions  of  divide  and  update  operations  to  carry  out  the  LU 
decomposition.  The  code  generated  was  a  high  level  language,  which  was  C[l5] 
in  this  case.  The  code  generator  essentially  goes  through  the  Doolittle  algorithm 
for  a  given  sparse  matrix  with  the  two  dimensional  linked  list  data  structure  as 
in[l],  puts  down  the  necessary  operations,  adds  fill-in  elements  into  the  linked 
list  until  we  finish  the  entire  decomposition  process.  One  of  the  advantages 
offered  in  C  is  the  easy  manipulation  of  files  and  its  standard  I/O  capabilities  A 
file  is  open  and  all  the  sequential  operations  are  stored  in  that  file.  Each  line  of 
code  in  the  file  is  identified  by  a  number.  These  numbers  will  facilitate  the 
detection  of  parallelism  among  the  operations,  which  will  be  discussed  in  the 
next  section.  In  the  actual  implementation,  these  numbers  are  the  indices  to  an 


array  of  pointers  pointing  to  the  corresponding  operations. 


2.3.2.  Simple  Example  of  the  Code  Generated 

With  the  description  of  the  concept  of  code  generation  in  the  previous  sec¬ 
tion,  perhaps  a  simple  example  of  a  file  containing  the  code  generated  from  the 
code  generator  will  make  the  idea  more  clear  and  concrete.  Suppose  we  are 
given  the  following  "sparse"  matrix  A  =  [a^] 


X  XX 


X  XX, 


where  xs  are  the  locations  of  the  nonzero  elements.  After  the  Doolittle  algo¬ 
rithm,  the  resultant  matrix  will  be 


xx  x  d 

X  X 

X  X  X 
X  X 


where  the  tfs  are  the  locations  of  the  fill-in  elements.  The  code  generated  for  the 
above  example  is  shown  in  Figure  2.1.  The  ax^y  symbol  in  the  code  denotes  the 
element  in  row  x  and  column  y.  Note  that  the  code  generated  does  not  contain 
any  loops  and  hence  no  addressing  is  necessary.  The  code  represents  a  sequen¬ 
tial  computations  only  on  the  nonzero  elements  or  the  matrix.  So  by  compiling 
the  above  code  and  running  it  on  a  single  processor,  we  will  obtain  the  in-place 
LU  decomposition  of  the  given  matrix. 

Let  us  return  to  the  problem  of  pivoting  for  accuracy.  Pivoting  is  difficult 
when  using  code  generation  since  some  of  the  variable  names  &x_y  will  be 
changed  when  the  swapping  of  rows  and  columns  is  done  in  pivoting.  If  pivoting  is 
necessary,  then  the  reordering  of  the  pivots  is  performed  and  the  whole  process 
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of  code  generation  has  to  be  repeated. 

2.4.  Graph  Model  for  LU  Decomposition 

The  sequence  of  instructions  of  divide  and  update  operations  generated  by 
the  code  generator  represents  the  in-place  LU  decomposition  process.  In  this 
section,  we  will  represent  the  LU  decomposition  process  by  a  directed  graph. 
Based  on  this  directed  graph,  scheduling  algorithms  can  be  applied  to  assign  the 
nodes  (  which  represent  the  divide  or  update  operations  )  to  different  proces¬ 
sors  for  concurrent  execution. 

Definition 

A  directed  graph  C  —  ( N,E,< )  is  a  graph  which  consists  of  a  set  of  nodes  N 
and  a  set  of  edges  E.  The  precedence  constraints  between  a  pair  of  nodes  i  and  j 
is  denoted  by  i  <  j  which  means  that  node  i  is  the  immediate  predecessor  of 
node  j  and  node  j  is  the  immediate  successor  of  node  i.  If  nodes  i  and  j 
represent  two  tasks(  processes),  then  node  i  has  to  be  completed  before  node  j 
can  begin  its  execution. 

In  the  following  sub-sections,  we  will  go  through  the  steps  in  more  detail  for 
obtaining  the  directed  graph  from  the  code  generated.  Based  on  this  graph, 
what  are  the  scheduling  techniques  we  can  use  in  order  to  reduce  the  execution 
time  of  the  LU  decomposition  of  a  given  matrix.  Scheduling  techniques  refer  to 
the  problem  of  assigning  nodes  in  the  directed  graph  to  processors  for  con¬ 
current  execution.  Then  a  brief  discussion  of  the  extension  to  the  other  algo¬ 
rithms  and  a  few  simple  illustrating  examples  are  given  at  the  end  of  this 
chapter. 

2.4.1.  Detection  of  Precedence  Constraints 

A  careful  study  of  the  operations  in  the  code  generated  reveals  that  it  is  not 


necessary  to  execute  the  code  in  the  order  in  which  the  operations  are  gen¬ 
erated.  In  other  words,  some  operations  can  be  executed  simultaneously.  The 
main  objective  of  this  sub-section  will  be  devoted  to  detect  which  operations  can 
be  done  in  parallel  and  which  operations  have  to  be  completed  before  the  other 
operations  can  begin  their  executions.  Before  we  go  on.  let  us  define  two  simple 
terms.  The  input  variables  are  those  variables  appearing  on  the  right  hand  side 
of  the  assignment  statment  of  an  operation  in  the  code.  The  output  variable  is 
the  variable  appearing  on  the  left  hand  side  of  the  assignment  statment.  Notice 
that  in  this  particular  LU  decomposition  application,  there  are  only  two  different 
operations,  namely  the  divide  and  update  operations.  In  the  divide  operation, 
there  are  two  input  variables.  In  the  update  operation,  there  are  at  most  three 
input  variables.  In  both  cases,  there  is  only  one  output  variable. 

The  idea  of  detecting  which  operation  precedes  the  other  is  very  simple.  If 
the  input  variables  of  an  operation  are  the  output  variables  of  some  previous 
operations,  then  there  are  precedence  constraints  between  these  operations. 
Otherwise,  that  particular  operation  can  be  initiated  immediately. 

2.4.2.  Implementation  of  Detection  of  Precedence  Constraints 

In  order  to  implement  the  above  detection  procedure  efficiently,  a  method 
having  the  following  two  properties  should  be  employed: 

(1)  It  will  identify  each  output  variable  uniquely. 

(2)  It  con  search  all  the  output  variables  as  efficiently  as  possible. 

The  idea  of  hashvng[  18]  seems  to  fit  the  above  two  requirements  and  a  hash 
function,  f(  )  is  needed  in  this  case.  In  the  most  simplest  terms,  a  hash  function 
takes  a  key,  K  (  e.g.  a  string  of  characters)  as  input  and  produces  a  number 
called  the  hash  code,  f(K)  as  output.  It  basically  breaks  some  aspects  of  the  key 
and  use  this  information  as  the  basis  for  searching. 


There  are  cases  where  two  distinct  keys  Kt  *  Kj  hash  to  the  same  hash  code 
/(Ai)  =  Such  an  occurrence  is  called  collision.  Functions  which  avoid 

duplicated  hashed  values  are  rare.  But  some  of  the  sophisticated  hash  functions 
have  a  very  low  probability  of  collision.  If  a  collision  has  occurred,  one  possible 
remedy  is  a  technique  called  sepa  ite  chaining[l6]  .  It  essentially  maintains 
several  linked  lists.  The  number  of  linked  lists  depends  on  the  largest  hash  code 
from  these  keys.  All  the  keys  having  the  same  hash  code  are  linked  together. 
The  linked  list  contains  the  appropriate  information  about  all  these  keys 

In  C,  there  is  a  convenient  way  of  storing  certain  information  about  an 
object.  It  is  called  structure  (in  Pascal,  it  is  called  record).  A  structure  can  have 
many  fields  associated  with  the  object.  In  this  particular  application,  each  key 
has  a  structure  with  two  fields  containing  the  character  strings  of  the  key  and  a 
pointer  pointing  to  the  next  key  with  the  same  hash  code.  The  following  simple 
example  illustrates  this  chaining  scheme.  For  example  there  are  five  keys; 

K-AB.  CAA.  BT,  TE.  ZXWP 
having  the  following  respective  hash  codes: 

/<*)-  3.  1.  4.  1.6 

Note  that  CAA  and  TE  have  the  same  hash  code.  Then  the  separate  chaining 
scheme  is  shown  in  Figure  2.2.  In  this  figure,  a  pointer  pointing  to  NULL  signifies 
the  end  of  the  linked  list. 

The  hash  code  is  the  index  of  a  pointer  array  pointing  to  these  keys.  The 
problem  of  resolving  the  collision  by  this  scheme  will  be  discussed  in  the  next 
sub-section  when  we  explain  how  to  get  the  LU  graph  model. 

2.4.3.  The  LU  Graph  Model 

In  this  section,  the  procedure  for  obtaining  the  LU  graph  model  from  the 
code  generated  is  described.  It  involves  the  detection  of  the  precedence  con¬ 
straints  among  the  operations  in  the  code.  At  the  end  of  this  section,  we  will  go 


through  one  step  of  the  procedure  to  illustrate  the  detection  process. 


Based  on  the  above  discussion  of  the  hashing  technique,  we  can  consider 
each  output  variable  in  the  code  as  a  key  in  the  separate  chaining  scheme.  The 
structure  of  an  output  variable  has  a  field  containing  a  string  of  characters  of 
the  output  variable  and  a  field  containing  a  pointer  pointing  to  the  next  output 
variable  with  the  identical  hash  code.  Recall  that  at  the  end  of  section  2.3.1, 
each  operation  in  the  code  generated  is  identified  by  a  number.  This  number  is 
the  index  to  an  array  of  pointers  pointing  to  that  operation.  It  also  gives  the 
position  of  the  operation  (as  well  as  the  output  variable  associated  with  that 
operation)  with  respect  to  the  other  operations.  Besides  maintaining  the  two 
fields  mentioned  above  in  the  structure  of  an  output  variable,  an  additional  field 
indicating  the  latest  position  of  the  output  variable  is  needed.  The  reason  is  that 
an  output  variable  can  be  subjected  to  several  divide  and  update  operations 
throughout  the  decomposition.  This  output  variable  then  will  appear  at  several 
locations  in  the  code.  Thus  the  latest  position  of  the  output  variable  should  be 
used  in  order  to  have  the  correct  precedence  constraints  among  the  operations. 
This  will  become  more  clear  in  the  subsequent  discussion.  The  structure  of  an 
output  variable  for  the  purpose  of  detecting  the  precedence  constraints  is 
defined  below : 

struct  out_yar  \ 

char  name[50] ;  /•  character  array  holding  the  character 
string  of  the  output  variable  •/ 
struct  out_yar  *ptnxt ;  /•  pointer  to  next  output  variable 

having  the  same  hash  code  •/ 
int  position  ;  /•  most  recent  position  of  the  output 
variable  in  the  code  */ 


i 

By  using  the  above  implementations,  the  detection  procedure  will  be 
described  to  illustrate  how  a  graph  model  for  LU  decomposition  is  obtained. 
Suppose  we  are  at  the  Ith  operation  which  we  will  call  the  current  operation  in 
the  code,  the  input  variables  of  the  current  operation  are  hashed  one  at  a  time 


and  search  the  hash  codes  of  the  output  variables  of  the  previous  ( t  —  1  )  opera¬ 
tions.  If  the  hash  code  of  the  current  input  variable  is  equal  to  one  of  the  hash 
codes  of  the  previous  output  variables,  then  this  particular  linked  list  is 
transversed  and  compared  with  the  string  of  characters  of  the  input  variable 
with  all  the  strings  of  characters  of  the  output  variables  in  that  linked  list.  If 
there  is  a  match,  there  exists  a  precedence  relationship  between  the  current 
operation  i  and  the  previous  operation  indicated  by  the  position  field  in  the 
structure  of  that  output  variable.  This  position  field  gives  the  latest  position 
that  this  output  variable  appears  in  the  code.  If  there  is  on  match,  a  collision 
has  occurred  and  the  current  operation  can  be  initiated  immediately. 

After  all  the  input  variables  have  been  tested,  the  output  variable  of  the 
current  operation  is  hashed.  Again  the  hash  code  is  used  to  search  through  the 
previous  (  i  — 1  )  output  variables.  If  it  is  equal  to  one  of  the  hash  codes  of  the 
previous  output  variables,  we  transverse  that  linked  list  and  compare  the 
current  input  variable  with  the  output  variables  in  that  linked  list.  If  there  is  a 
match,  we  then  update  the  latest  position  of  this  output  variable.  If  there  is  no 
match,  a  collision  has  occurred  and  the  current  output  variable  is  added  to  the 
end  of  the  linked  list.  If  the  hash  code  of  the  current  output  variable  is  not  equal 
to  one  of  the  hash  codes  of  the  previous  output  variables  to  start  with,  we  create 
a  linked  list  for  this  current  output  variable  and  add  it  to  the  pool  of  the  previ¬ 
ous  (  i  -  1  )  output  variables  to  be  searched  in  the  next  stage  of  detection. 

The  above  procedure  is  repeated  for  i  =  1  until  i  =  M  where  M  is  the  total 
number  of  operations  in  the  code.  In  order  to  understand  the  above  procedure 
of  detecting  the  precedence  constraints,  let  us  go  through  one  step  of  the  pro¬ 
cedure  by  taking  the  code  generated  shown  in  Figure  2.1.  Each  operation  is 
identified  by  a  number  signifying  its  position  in  the  code.  For  example  a2_J  = 
a2_J  /al_J  is  identified  as  the  first  operation  and  a5_J  =  a5_J  /  al_j  is  identified 


25 


as  the  second  operation  and  so  on.  Suppose  the  current  operation  we  are  now 
working  on  is  13  (  i  =  13  ).  this  operation  is  a5_4  =  a5_4  /  a4_4.  Its  input  vari¬ 
ables  a5_4  and  a4_4  are  hashed.  For  simplicity,  let  their  corresponding  hashed 
codes  are  32  and  27  respectively.  (  i.e.  /(a5_4)  =  32  and  f{ a4_£)  =  27  )  Then  a 
search  is  done  on  all  the  hashed  codes  of  the  previous  twelve  output  variables  ( 
since  there  are  twelve  operations  before  the  current  one  )  to  see  whether  these 
hashed  codes  match  with  the  hashed  codes  of  the  current  input  variables.  In  this 
example,  two  matches  are  found.  One  match  corresponds  to  the  input  variable, 
a5_4  which  is  the  output  variable  of  the  fifth  operation  in  the  code  (  this  output 
variable  should  have  32  as  its  hashed  code  ).  Another  match  corresponds  to  the 
input  variable,  a4_4  which  is  the  output  variable  of  the  eighth  operation  (  this 
output  variable  should  have  27  as  its  hashed  code  ).  Hence  a  precedence  con¬ 
straint  exists  between  the  5**  operation  and  the  13**  operation  and  also  between 
the  8°*  operation  and  the  13**  operation.  Then  the  output  variable  a5_4  of  the 
current  output  variable  is  hashed.  This  hashed  code  is  compared  with  the 
hashed  codes  of  the  previous  twelve  operations.  A  match  is  found  because  this 
current  output  variable,  a5_4  also  appears  as  the  output  variable  of  the  fifth 
operation  in  the  code.  In  this  case,  the  position  field  in  the  data  structure  of  this 
output  variable  is  updated  to  13  from  its  original  value  of  5.  This  updating  is 
necessary.  Because  in  the  later  steps  of  detection,  if  an  input  variable  of  an 
operation  happens  to  be  a5_$.  then  the  13**  operation  (  which  is  the  most  recent 
operation  in  which  a5_4  is  the  output  variable  )  is  identified  as  the  predecessor 
rather  than  the  5**  operation  is  taken  incorrectly. 

An  example  of  the  graph  model  for  decomposition  is  shown  in  Figure  2.3. 
This  graph  results  from  the  example  in  section  2.3.2  by  running  the  code  gen¬ 
erated  through  the  detection  of  precedence  constraints  process.  Each  node  in 
the  graph  represents  either  an  update  operation  or  a  divide  operation.  The 


number  associated  with  each  node  represents  position  in  the  code.  That  is  node 
i  represents  the  i**  operation  from  the  beginning  of  the  code. 

2.4.4.  Data  Structure  for  LU  Task  Graph 

In  order  to  manipulate  the  LU  task  graph  by  a  computer,  a  convenient  way 
of  representing  the  precedence  relationships  among  these  nodes  is  necessary. 
There  are  many  ways  of  representing  a  directed  graph  in  a  computer.  Here  we 
choose  a  matrix  representation  called  the  connectivity  matrix  representation. 
Each  element  Cy  of  the  connectivity  matrix  C  is  defined  as  follows  : 

1  If  node  i  is  the  immediate  predecessor  of  node  j  ; 

CV  ~  NULL  otherwise ; 

Note  that  in  the  LU  task  graph,  there  are  at  most  three  immediate  predecessors 
for  a  node  because  there  are  at  most  three  input  variables  for  an  update  opera¬ 
tion  and  there  are  only  two  input  variables  for  a  divide  operation.  Hence  the  con¬ 
nectivity  matrix  is  extremely  sparse.  There  are  no  more  than  three  l's  per  row. 
A  natural  choice  for  the  data  structure  of  the  connectivity  matrix  is  the  linked 
list.  Since  the  manipulations  on  the  LU  task  graph  involve  the  deleting  the  rows 
and  columns  in  the  connectivity  matrix,  an  easy  access  to  any  of  the  nonzero 
element  from  the  deleting  row  and  column  in  the  matrix  will  be  a  very  efficient 
implementation.  Hence  a  symmetrically  double  linked  list  is  employed  as  the 
data  structure  for  the  connectivity  matrix.  Each  nonzero  element  in  the  matrix 
is  a  structure  in  C  defined  as  below  : 
struct  element  {  int  irow,  jcol; 
struct  element  *pt_top; 
struct  element  *pt_right; 
struct  element  *pt_down; 
struct  element  *pt_left; 

i : 


» 


I 


I 


i 


» 


» 


» 


D 


P> 


h 


This  structure  has  six  fields.  Two  of  these  contain  the  row  index  (irow)  and 
column  (jcol).  The  other  four  are  pointers  pointing  to  the  nonzero  element  to  its 
right  (•ptj-ight),  left  (*ptjeft),  above  (*pt_top)  and  below  (*pt_down).  With  this 
data  structure,  we  can  reach  any  nonzero  element  from  the  deleting  row  and 
column  very  conveniently.  A  schematic  diagram  of  this  symmetrically  double 
linked  list  data  structure  is  shown  in  Figure  2.4 

2.5.  Processor  Scheduling  Techniques 

Processor  scheduling  implies  that  tasks  (i.e.  code  segments  in  a  program  or 
nodes  in  the  LU  task  graph)  are  to  be  assigned  to  a  particular  processor  for  exe¬ 
cution  in  a  particular  order.  At  a  certain  time  step,  there  may  be  more  than  one 
task  ready  for  execution.  Hence  it  is  necessary  to  have  a  representation  which 
conveniently  represents  the  relationship  among  these  tasks.  A  directed  graph  or 
precedence  graph  representation  such  as  the  LU  task  graph  is  probably  the 
most  popular  representation  in  the  scheduling  literature.  The  nodes  in  the  graph 
represent  the  independent  operations  which  are  related  to  each  other  in  time. 
The  three  assumptions  cited  in  the  theory  of  deterministic  scheduling  are  : 

(1)  The  directed  graph  is  acyclic.  That  is  there  are  no  loops  or  cycles  in  the 
graph.  The  presence  of  a  loop  makes  a  scheduling  before  execution  time 
impossible  since  the  conditional  which  controls  the  number  of  iterations 
cannot  be  known  until  execution  time. 

(2)  The  execution  time  of  each  node  is  known  in  advance.  That  is  we  are  not 
dealing  with  stochastic  scheduling  in  which  the  execution  times  are  random 
variables. 

(3)  Interruption  before  task  completion  is  not  allowed.  This  is  called  the 
nonpreemptive  scheduling  technique.  In  some  cases  where  preemptive 
technique  (interruption  of  a  task  is  allowed)  can  generate  better  schedules 
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than  nonpreemptive  techniques  due  to  the  efficient  allocation  of  processors 
computing  resources.  However  if  preemption  occurs  frequently,  the  over¬ 
head  of  system  interrupt  processing  and  the  additional  memory  required  to 
store  the  state  of  the  interrupted  task  may  outweight  the  benefits  of 
preemptive  scheduling. 

2.5. 1.  Performance  Criteria  of  a  Schedule  and  Efficiency  of  an  Algorithm 

To  evaluate  the  effectiveness  of  the  schedules,  the  most  often  quoted  meas¬ 
ures  are: 

(1)  Minimize  the  completion  time. 

(2)  Minimize  the  number  of  processors  needed. 

Presently  the  quest  for  computing  speed  seems  of  greater  concern  than  the 
cost  of  hardware.  Hence  the  first  criterion  is  far  more  important  than  the 
second  one.  Most  of  the  effort  is  directed  towards  minimizing  the  completion 
time  of  a  given  job. 

The  running  time  of  an  algorithm  to  locate  a  processor  schedule  is  also  very 
important.  For  the  purpose  of  comparison,  we  call  an  algorithm  efficient  if  it 
requires  computation  time  which  is  bounded  in  the  size  of  its  input  by  some 
polynomial.  An  inefficient  algorithm  is  the  one  which  requires  an  enumeration 
of  all  possible  solutions  before  the  optimal  schedule  can  be  chosen.  The  running 
times  of  these  algorithms  are  exponential  to  the  size  of  its  input.  For  most  of  the 
scheduling  problems  stated  in  their  generalities,  very  few  have  efficient  algo¬ 
rithms  to  obtain  the  optimal  schedules.  This  family  of  intractable  problems  are 
called  NP  complete  problems. 

There  are  various  scheduling  algorithms  for  different  scheduling  environ¬ 
ments.  Under  some  very  restricted  constraints,  optimal  processor  schedules 
can  be  obtained.  An  example  used  here  is  the  level  scheduling  technique  by 


Hu[l7],  The  scheduling  environment  in  Hu's  algorithm  is  that  the  directed 
graph  is  a  rooted  tree  and  the  execution  times  on  the  nodes  of  the  tree- 
structured  graph  are  equal.  The  details  of  Hu's  level  scheduling  algorithm  will  be 
discussed  in  the  next  chapter. 

Under  some  other  unrestricted  environments  such  as  arbitrary  graph 
structures,  two  or  more  processors,  unequal  task  execution  times,  developing 
heuristic  scheduling  algorithms  with  polynomial  bounded  running  time  which 
give  suboptimal  schedules  is  a  feasible  approach.  Five  heuristic  scheduling  algo¬ 
rithms  are  studied  by  Adam  et  al.[l8]  Two  of  the  heuristics  are  based  on  the 
concept  of  the  level  of  a  node,  one  is  assigning  tasks  randomly  to  processors  and 
the  other  two  are  based  on  the  concept  of  co-level. 

In  the  next  three  chapters,  the  level  concept  of  a  node  is  used  extensively 
in  the  heuristic  scheduling  algorithm  when  there  is  communication  delay  in  the 
switching  network.  Hence  it  will  be  appropriate  to  define  the  level  associated 
with  a  node,  which  is  the  sum  of  all  the  execution  times  of  the  nodes  on  the  long¬ 
est  path  from  the  node  to  the  terminal  node.  The  co-level  of  a  node  is  measured 
in  the  same  manner  as  the  level,  except  that  the  length  of  the  path  is  computed 
from  the  entry  node  rather  than  from  the  terminal  node.  Extensive  simulation 
results  have  shown  that  the  heuristics  having  priority  assigned  according  to 
level  (i.e.  the  larger  the  level,  the  higher  the  priority)  perform  better  than  the 
others.  These  heuristics  are  referred  as  longest-path  scheduling.  The  level  of 
performance  of  these  heuristics  has  been  reported  by  Adam  et  al  to  be  within 
4.45!  of  optimal.  This  near-optimality  has  also  been  observed  by  other  research¬ 
ers  such  as  in[l9]  and[20].  It  demonstrates  that  this  longest-path  scheduling  is 
indeed  an  excellent  candidate  for  obtaining  processor  schedules  under  the 
above  stated  environments. 

There  are  other  scheduling  algorithms  of  interest.  Ramamoorthy,  Chandy 
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and  Gonzalez[2l]  developed  algorithms  to  determine  the  minimum  number  of 
processors  required  to  process  an  arbitrary  graph  in  minimum  time  and  to 
determine  the  minimum  time  to  process  a  graph  given  the  number  of  proces¬ 
sors. 

2.6.  Extension  to  Other  Algorithms 

The  idea  of  obtaining  the  graph  model  by  exploiting  the  maximum  parallel¬ 
ism  in  the  algorithms  can  be  extended  to  many  other  areas  such  as  digital  sig¬ 
naling  processing.  In  this  section,  three  parallel  algorithms  are  presented. 
These  three  algorithms  are  the  inner  product  of  two  vectors,  the  linear  convolu¬ 
tion  of  two  sequences  and  the  FFT.  For  each  algorithm,  the  graph  model  or  data 
dependency  graph  is  illustrated  to  show  the  communication  pattern  for  all  the 
operations. 

2.6.1.  The  Inner  Product 

Consider  the  problem  of  computing  the  inner  product  of  two  vectors 

a  =  (al#  a2 . a.N)  and  b  =  (6lf  b2 . byy).  This  can  be  performed  in  a  program 

loop  as  follows: 

/or(  t  =  l;  i*N:  i  +  +  ) 
x  =  x  +  a*  •  bi  ; 

where  x  is  initially  set  to  zero.  The  final  value  of  x  will  be  the  inner  product  of  a 
and  6.  We  can  expand  the  expression  inside  the  loop  by  substituting  the  right 
hand  side  into  itself  as  follow: 

*  =  a,  •  6j  ; 

x  =  a,  •  bt  +  a2  •  b2; 

x  =  ax  •  6t  +  a2*  b2  +  aa  •  b3; 


x  =a,  •  6,  +a2*  b2  + 


+  aN*  bM; 


The  last  step  of  the  above  N  iterations  reveals  that  considerable  parallelism 
exits  in  computing  the  inner  product  of  two  given  vectors.  For  the  sake  of  sim¬ 
plicity  in  explanation,  let  us  assume  N  =  4  and  the  code  generated  is  shown 
below: 

*!  =  ttj  •  61; 

Xz  —  0,2  *6  2  I 
x3  =  a3  •  6  3  ; 
x4  =  a4  •  64  ; 
x8  =  xi  +  x2 ; 
xe  =  *3  +  x* : 

Xf  —  Xg  +  Xg  ; 

It  is  clear  from  the  above  seven  steps  that  the  first  four  steps  can  be  done  simul¬ 
taneously.  Then  the  5th  and  the  6**  can  be  performed  concurrently  next.  Hence 
the  computational  graph  or  the  data  dependency  graph  for  that  particular 
example  is  shown  in  Figure  2.5.  In  general,  if  the  two  given  vectors  are  of  size  N, 
the  inner  product  computation  can  be  mapped  into  a  tree  of  log 2N  levels.  The 
whole  computation  can  be  done  in  0(ZopzW)  time  steps  by  using  0(n)  proces¬ 
sors.  By  using  a  single  processor,  ZN-1  time  steps  are  required  where  one  time 
step  is  either  a  multiplication  or  addition. 

2.8.2.  Linear  Convolution 

In  most  applications  of  digital  signal  processing,  we  might  be  interested  in 
implementing  a  linear  convolution  of  two  sequences.  An  example  is  the  filtering 
of  a  sequence  such  as  speech  waveform  or  a  radar  signal.  Let  us  consider  the 
two  N-point  sequences  A*  and  xn  and  let  yn  denote  their  linear  convolution: 

Vn  =  (2.12) 

m>0 

Ihe  sequence  A*  can  be  treated  as  the  given  unit-sample  response  of  a  linear 
time  invariant  system  and  xn  is  a  sequence  of  sampling  values  of  a  signal  which 
we  want  to  perform  some  filtering,  and  yn  is  the  output  sequence. 
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If  (2.12)  is  expanded,  the  following  output  sequence  yn  is  obtained  : 

Vo  =  *0^0  : 

V  i  =  ^oh !  +  x  j/io  : 

Vz  =  zo^z  +  zi^i  +  zz^o  : 

Vs  “  z0^3  +  z1^2  +  z2^1  +  ZS^0  • 


Given  all  the  A*,  we  can  see  immediately  that  the  multiplication  of  x0  with  A*  can 
be  done  in  parallel  once  we  know  Xq.  Similarly  if  we  know  xlt  the  multiplication  of 
Xj  with  A*  can  be  performed  simultaneously  etc.  The  data  dependency  graph  for 
the  case  of  N  =  4  is  shown  in  Figure  2.6  In  general,  if  the  two  given  sequences 
are  of  length  N,  their  linear  convolution  can  be  computed  in  0(2N -1)  time 

steps  using  0(n)  processors.  By  using  a  single  processor.  2*^  (2fc-l)  +  (2N-1) 

time  steps  are  required.  Again  one  time  step  is  either  a  multiplication  or  an 
addition. 

2.6.3.  Fast  Fourier  Transform 

There  are  many  situations  such  as  in  speech  recognition  and  in  radar  sys¬ 
tems  where  digital  signal  processing  involves  the  analysis  of  the  spectrum  of  the 
incoming  signal.  The  discrete  Fourier  transform  (DFT)  is  the  central  computa¬ 
tion  in  these  spectral  analysis  problems.  The  set  of  algorithms  known  as  fast 
Fourier  transform  (  FFT  )  are  the  fast  implementation  of  DFT.  These  FFT  algo¬ 
rithms  can  sometimes  improve  by  a  factor  of  100  or  more  over  the  direct 
evaluation  of  the  DFT. 

The  DFT  of  a  finite  duration  sequence  x(n)  ,  O^n^N-1,  is  defined  as  fol¬ 

low: 

X(k)  =^lx(Ti)Fr*  k  =0,1 . N-l  (2.13) 

n-0 


O 


'  V  V  V 


where  W=e~^2n/ffK  The  idea  behind  the  FFT  is  to  break  the  original  N-point 
sequence  into  two  (N/2)-point  sequences  which  in  terms  can  be  further  divided 
into  four  (N/4)-point  sequences  etc.  This  process  can  be  iterated  until  we  have 
(N/2)  2-point  sequences.  This  is  the  radix  2  FFT.  The  original  N-point  DFT  is 
obtained  by  combining  these  2-point  DFTs  with  the  appropriate  complex  scaling 
factors.  A  detail  discussion  of  the  FFT  can  be  found  in[22]. 

The  advantage  of  implementing  FFT  on  a  multiprocessor  system  lies  on  the 
fact  that  the  DFT  of  the  original  sequence  can  be  decomposed  into  many  stages 
of  smaller  sequences  of  DFTs.  By  performing  these  DFTs  of  smaller  sequences  in 
paralljl  using  many  processors,  a  speedup  of  0(n)  will  be  expected.  Figure  2.7 
shows  the  data  dependency  graph  for  an  eight-point  DFT  constructed  from  four 
two-point  DFTs. 

2.7.  Conclusion 

In  this  chapter,  the  Doolittle  algorithm  for  LU  decomposition  is  described.  A 
sequence  of  operations  is  generated  by  a  code  generator  representing  the  LU 
decomposition  process.  Baser'  on  these  operations,  the  precedence  constraints 
among  these  operations  are  detected  and  the  corresponding  LU  task  graph  is 
obtained.  Different  scheduling  algorithms  are  discussed  and  Hu's  level  schedul¬ 
ing  algorithm  seems  attractive  for  obtaining  a  schedule  for  the  LU  task  graph.  In 
the  next  chapter,  Hu's  level  scheduling  algorithm  is  applied  to  several 
coefficient  matrices  from  some  benchmark  circuits  in  SPlCEfl].  The  idea  of 
generating  a  graph  model  for  computation  to  exploit  the  maximum  parallelism 
has  been  extended  to  the  inner  product  computation,  linear  convolution  and 
FFT.  This  demonstrates  the  feasibility  of  applying  this  approach  to  many  algo¬ 
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•include  <»tdio. h> 
•include  "decl.h" 
CODE ( ) 

< 

int  tdl.  td Si 


tdl  - 

ugetclkO  j 

m2  1 

m 

•2_1  /  al_l  j 

•  S_l 

m 

•5_1  /  al_l  i  • 

m2  4 

m 

-•2_1  *  al_4  j 

m2  jt> 

m 

~«2__1  *  m lZb  i 

•  3  4 

m 

-«5_1  *  al__4  i 

•  3_6 

m 

-a 3  1  «  il  6  i 

•4  3 

m 

a4_2  /  m2_2  i 

•4_4 

m 

«4_4  ~  «4_2  *  «2_4 

J 

•4_3 

m 

«4_5  -  «4_2  #  «2_5 

i 

•  4_6 

m 

-•4_2  *  «P__6  » 

•6_3 

m 

•6_3  /  «3_3  i  • 

•  6  6 

m 

•6_6  -  «6__3  *  l3  i 

i 

•  3  4 

m 

«3_4  /  «4_4  i 

*3_3 

m 

•  3_3  -  «5_4  *  «4__S 

i 

•  5_6 

m 

m3jb  ~  «3_4  *  «4~6 

i 

mbJS 

m 

m6JS  /  «3_5  j 

mbjb 

m 

•6_6  “  «6„J5  *  «3^6 

i 

td2  “  ugetc 1 k ( )  j 

<1 


printf < “Result  For  Sinqle  Processor: \n")  j 

printf  ( "Time  to  Run  The  Code  Is  Xd  Microseconds.  \n",  td2-tdl-2>(1 


< 

Figure  2.1  A  Simple  Example  of  the  Code  Generated 
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!•  CHAPTER  3 

J3 

>  SCHEDULING  ALGORITHMS 

m j 

>  WITHOUT  COMMUNICATION  DELAY  CONSIDERATION 


3.1.  Introduction 

In  chapter  two.  a  detailed  procedure  for  obtaining  the  graph  model 
representing  the  LU  decomposition  of  a  matrix  is  described.  Based  on  this  LU 
graph  model,  existing  scheduling  techniques  can  be  used  to  assign  the  nodes 
which  represent  either  update  or  divide  operations  to  processors  for  concurrent 
execution.  A  brief  survey  on  the  processor  scheduling  techniques  was  summar¬ 
ized  in  the  previous  chapter. 

In  this  chapter,  a  discussion  of  the  list  schedules  obtained  from  a  class  of 


list-scheduling  algorithms  will  be  given.  A  comparison  of  these  list  scheduling 
algorithms,  based  on  the  studies  by  Adam[l8]  ,  is  made  in  order  to  show  the 
near-optimality  of  the  longest  path  scheduling  technique  using  the  concept  of 
level  associated  with  a  node  in  the  task  graph.  In  the  latter  part  of  the  chapter, 
the  most  often  cited  scheduling  algorithm  in  deterministic  processor  scheduling 
theory,  the  Hu’s  level  scheduling  algorithm[l7]  ,  is  described.  The  performance 
of  this  scheduling  technique  when  applied  to  several  circuit  matrices  obtained 
from  some  benchmark  circuits  in  SPICE[  l]  is  presented.  The  performance  is 
evaluated  in  two  ways.  First,  it  is  measured  when  there  is  no  delay  between  com¬ 
municating  processors,  and  second  it  is  measured  when  there  is  a  constant 
delay  between  processors  exchanging  data.  In  the  next  few  sections,  the  precise 
meaning  of  performance  in  the  context  of  multiprocessing  and  the  communica¬ 


tion  delay  in  the  multiprocessor  system  will  be  discussed.  At  the  end  of  the 
chapter,  simulation  results  are  presented  with  a  discussion  on  the  impact  of  the 
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delay  on  the  speedup  performance  of  Hu's  level  scheduling  algorithm. 

3.2.  list  Schedules 

In  its  simplest  terms,  scheduling  is  the  problem  of  determining  which  task  ( 
node  )  a  given  processor  should  execute  at  each  timestep.  list  schedules  are 
the  most  common  schedules  and  are  obtained  from  the  list-scheduling  algo¬ 
rithms.  These  schedules  are  a  class  of  implementable  schedules  in  which  each 
task  is  assigned  a  numerical  value  which  is  its  priority  and  these  tasks  are 
arranged  in  a  list  of  decreasing  order  of  magnitudes  of  the  priorities.  When  a 
processor  is  available,  the  list  is  scanned  for  an  executable  task  in  the  list  in  the 
order  of  decreasing  priority.  If  there  is  more  than  one  executable  tasks  with  the 
same  priority,  one  task  is  chosen  at  random.  The  way  of  assigning  priorities  to 
the  nodes  is  different  for  different  scheduling  algorithms,  resulting  in  different 
schedules.  In  this  section,  a  few  list-scheduling  algorithms  are  discussed,  and  in 
the  next  section,  Hu's  level  scheduling  algorithm  (  which  can  be  considered  a 
list-scheduling  algorithm  )  is  discussed. 

These  list-scheduling  algorithms  have  been  studied  extensive  in[l8]  .  They 
are  based  on  the  notions  of  level  and  co-level  associated  with  a  node.  These  two 
terms  have  been  defined  in  chapter  two.  The  scheduling  environment  is  unres¬ 
tricted  in  the  sense  that  the  precedence  constraints  between  tasks  and  the  exe¬ 
cution  times  of  the  tasks  are  arbitrary.  These  list-scheduling  algorithms  are  : 

(1)  Highest  Levels  First  With  Estimated  Times 

The  list  schedules  generated  are  obtained  from  a  list  in  which  all  tasks  are 
assigned  with  priorities  according  to  their  levels.  The  larger  the  level,  the 
higher  the  priority.  This  is  basically  Hu’s  level  scheduling  algorithm. 

(2)  Highest  Levels  First  With  No  Estimated  limes 

The  priority  of  each  task  is  also  its  level.  However  th^  level  is  computed 
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under  the  assumption  that  all  the  tasks  have  the  same  execution  time.  This 
is  useful  when  no  estimates  for  the  task  execution  times  are  available.  For 
example,  it  may  be  possible  to  partition  a  program  into  modules  and  yet 
there  are  no  estimates  for  the  execution  time  of  each  module  due  to  the 
conditional  branches  etc.  In  this  case,  this  list-scheduling  algorithm  may 
provide  satisfactory  schedules. 

(3)  Random 

The  tasks  are  assigned  priorities  randomly. 

(4)  Smallest  Co-levels  First  With  Estimated  Times 

The  priority  of  a  task  is  the  negative  of  its  co-level.  In  other  words,  the 
smaller  the  co-level,  the  higher  the  priority. 

(5)  Smallest  Co-levels  First  With  No  Estimated  Times 

This  is  the  same  as  (4)  above  except  the  co-levels  are  computed  under  the 
assumption  that  all  tasks  have  the  same  execution  time. 

Extensive  testing  has  been  done  on  the  performance  (  measured  by  the 
completion  time  of  the  task  graph  )  on  graphs  generated  randomly.  Results  have 
shown  that  the  "  Highest  Levels  First  With  Estimated  Times  "  scheduling  algo¬ 
rithm  performs,  most  of  the  time,  better  than  the  other  four  list-scheduling 
algorithms  (  see[l8]  ).  This  particular  scheduling  algorithm,  which  is  essentially 
Hu's  level  scheduling  algorithm  deserves  more  examination  and  it  is  discussed 
in  more  detail  in  the  next  section. 

3.3.  Hu's  Level  Scheduling  Algorithm 

The  Hu’s  level  scheduling  algorithm  is  perhaps  the  most  referenced  algo¬ 
rithm  in  the  scheduling  literature.  This  algorithm  is  simple  and  efficient.  The 
running  time  is  O(N)  where  N  is  the  number  of  nodes  in  the  task  graph,  it  is 
proved  in[l7]  that  this  algorithm  generates  optimal  schedules  for  tree-structure 
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precedence  graphs  in  which  all  the  nodes  have  the  same  processing  time.  It  can 
be  applied  to  an  arbitrary  structured  graph  and  satisfactory  schedules  are 
obtained  most  of  the  time.  In  this  section,  this  algorithm  is  applied  to  graphs 
obtained  from  the  LU  decomposition  of  circuit  matrices,  and  the  performance  of 
this  scheduling  algorithm  is  presented. 

The  LU  task  graphs  obtained  from  the  circuit  matrices  generally  have  little 
structure.  To  be  more  specific,  the  precedence  constraints  governing  the  order 
in  which  nodes  are  processed  in  the  LU  decomposition  can  be  completely  arbi¬ 
trary.  The  only  assumption  we  are  making  is  that  the  execution  time  of  the 
operations  (  either  update  or  divide  )  is  the  same.  In  this  unconstrained  schedul¬ 
ing  environment,  obtaining  an  optimal  schedule  for  processors  seems  intract¬ 
able.  Extensive  studies  as  described  in  the  last  section  have  shown  that  heuris¬ 
tic  scheduling  algorithms  employing  the  concept  of  level  give  near-optimal 
schedules.  Hu's  level  scheduling  algorithm  proves  to  be  perhaps  the  most 
appropriate  method  to  use  in  this  situation. 

The  definition  of  level  of  a  node  is  given  in  the  last  chapter.  It  represents 
the  longest  distance  from  that  node  to  the  terminal  node.  The  distance  is  the 
sum  of  all  the  executing  times  of  the  nodes  along  that  longest  path.  An  example 
of  a  directed  graph  with  the  level  labelled  for  each  node  is  shown  in  Figure  3.1. 
In  a  certain  sense,  the  level  of  a  node  gives  the  minimum  number  of  time  units 
required  to  finish  the  execution  of  all  the  nodes  with  lower  level,  assuming  there 
are  enough  processors  available. 

Based  on  this,  it  is  natural  to  assign  a  higher  priority  for  earlier  execution 
to  nodes  with  higher  level.  The  high  level  pesudo-code  for  Hu’s  algorithm  is  given 


Among  the  nodes  ready  for  assignment  to  processors,  pick  the  ones 
with  the  highest  level  and  schedule  them  to  the  available  processors.  If 
there  are  more  nodes  with  the  highest  level  than  there  are  available 
processors  .  choose  arbitrary  node. 

I 

An  example  schedule  for  the  graph  of  Figure  3.1  is  given  in  Figure  3.2. 
These  figures  representing  the  schedules  for  the  processors  are  referred  to  as 
Grantt  charts. 

In  the  context  of  multiprocessing,  one  of  the  most  important  performance 
criteria  of  a  schedule  is  the  speedup  ratio  achieved.  It  is  defined  as: 


sveedun  ratio  -  eomPk**°n  time  using  one  processor 
mpw  aup  completion  time  using  m  processors 

The  speedup  ratio  achieved  as  a  function  of  the  number  of  processors  available 
is  given  in  Figure  3.3.  This  example  task  graph  has  12B  nodes.  A  45-degree  line  is 
drawn  in  the  figure  representing  the  ideal  case  where  given  m  processors,  a 
speedup  ratio  of  m  is  obtained.  This  can  be  also  viewed  as  the  case  in  which  all 
the  nodes  represent  independent  operations,  and  do  not  have  any  precedence 
constraints  among  them.  This  example  reveals  that  as  the  number  of  processors 
increases,  the  speedup  ratio  increases.  When  the  number  of  available  processors 
is  small,  the  speedup  ratio  is  very  close  to  the  ideal  case.  As  the  number  of  pro¬ 
cessors  Increases  further,  the  speedup  ratio  increases  more  slowly  until  at  a 
certain  point  any  gain  in  speedup  ratio  is  not  possible.  This  can  be  explained  by 
the  fact  that  when  more  processors  are  available  than  necessary,  the  idle  times 
in  the  processors  will  be  larger  and  they  are  not  performing  any  computation. 
This  is  also  related  to  the  bounds  which  will  be  discussed  in  the  following  sec¬ 


tions. 


3.3. 1.  Minimum  Time  Required  to  Finish  the  Task  Graph 

Recall  that  the  level  of  a  node  is  the  sum  of  all  the  execution  times  of  the 
nodes  along  the  longest  path  from  that  node  to  the  terminal  node.  In  other 
words,  it  is  the  minimum  time  units  required  to  finish  the  rest  of  the  task  graph 
from  this  node.  Hence  the  largest  level  number  in  the  task  graph  gives  the 
minimum  time  required  to  complete  the  task  graph.  This  minimum  time  is 
denoted  by  T„ir,.  In  the  following  two  subsections,  the  bounds  on  the  number  of 
processors  required  to  finish  the  task  graph  by  are  derived.  The  derivation 
is  based  on[2l]  and[l7]  . 

3.3.2.  Maximum  Number  of  Processors  Required 

This  bound  is  important  because  it  allows  a  more  efficient  allocation  of  com¬ 
puting  resources.  If  the  number  of  processors  allocated  is  larger  than  this 
bound,  the  additional  processors  will  not  reduce  the  time  required  to  complete 
the  task  graph.  Let  p(i)  denote  the  number  of  nodes  with  level  equal  to  i.  Let 
Uax_P  =  Max  \  p(i)  j.  Then  if  Max_P  processors  are  available,  all  the  nodes  with 
the  same  level  can  be  executed  at  each  timestep.  Hence  the  task  graph  can  be 
completed  in  a  time  equal  to  Tal  n* 

3.3.3.  Minimum  Number  of  Processors  Required 

Since  p(i)  denotes  the  number  of  nodes  with  level  equal  i,  then  £p(i)  is  the 

<*i 

total  number  of  the  nodes  with  level  equal  to  or  less  than  x.  In  order  to  finish  all 
these  nodes  in  time  x,  it  requires  at  least  INT[— £p(i)]  +  1  processors  where 

xis  i 

INT[x]  is  an  integer  obtained  by  taking  the  integral  part  of  x.  Let  L  be  the  larg¬ 
est  level  in  the  task  graph.  The  minimum  number  of  processors  required  to 

finish  the  execution  of  all  the  nodes  by  T’min  i®  INTI  Afoxr  ±-£p(i)  ]  +  1. 
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3.3.4.  Maximum  Achievable  Speedup  Ratio 

Let  N  denote  the  number  of  nodes  in  the  task  graph  and  f*  the  execution 
time  for  node  i.  Then  it  takes  time  units  to  complete  the  task  graph  by 

t«i 

using  one  processor.  Since  it  requires  a  minimum  of  Tt ^  time  units  to  finish  all 
the  nodes,  the  maximum  speedup  ratio  that  can  be  achieved  is  given  by 

^  ti  /  T min¬ 
is] 

3.4.  Impact  of  Communication  Delay  on  Speedup  Performance 

Additional  examples  on  the  speedup  performance  of  Hu’s  level  scheduling 
algorithms  are  shown  from  Figure  3  4(a)  to  Figure  3.4(b).  The  bounds  on  the 
number  of  processors,  the  number  of  levels  and  the  maximum  achievable 
speedup  ratio  are  shown  on  each  example  task  graph. 

All  these  examples  show  that  indeed  Hu’s  level  scheduling  technique  does 
give  quite  promising  results  when  the  communication  delay  between  processors 
is  ignored.  However  for  a  realistic  distributed  computing  system,  the  issue  of 
communication  overhead  cannot  be  overlooked.  It  will  lengthen  the  time 
required  to  complete  the  task  graph  and  hence  will  degrade  the  speedup  perfor¬ 
mance  of  a  multiprocessing  system  especially  when  the  communication  delay  is 
large  compared  to  the  node  execution  time.  In  the  next  few  sections,  the  impact 
of  the  delay  in  transmitting  the  data  on  this  degradation  will  be  discussed. 

In  many  applications,  the  main  purpose  of  distributed  computing  is  to 
increase  processing  speed  by  parallel  execution,  reducing  the  completion  time 
for  a  given  task.  However  in  any  distributed  processing  environment,  the  original 
task  is  divided  into  many  different  modules  which  are  assigned  to  different  pro¬ 
cessors  for  concurrent  execution.  In  many  real  applications,  some  of  the 
modules  depend  on  the  results  of  other  modules  in  order  to  continue  the  execu- 
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tion.  This  exchange  of  data  reduces  the  overall  processing  speed  because  there 
is  a  finite  delay  before  the  necessary  message  arrives  at  the  destination.  This 
communication  delay  not  only  increases  the  time  needed  to  complete  a  given 
task,  but  it  also  degrades  the  distributed  system  by  not  allocating  the  comput¬ 
ing  resources  efficiently.  The  processors  sitting  idW.  waiting  for  data  perform  no 
useful  function.  It  is  desirable  for  the  scheduling  algorithms  to  minimize  these 
undesirable  effects. 

None  of  the  schedules  described  above  considers  the  communication  over¬ 
head.  Stone[23]  has  proposed  a  network  flow  algorithm  for  multiprocessing 
scheduling  with  communication  delay  between  modules.  He  formulated  the 
scheduling  as  a  commodity  flow  problem  and  obtained  an  assignment  by  maxim¬ 
izing  the  flow  through  the  network.  This  method  is  not  applicable  here  because 
there  are  no  precedence  constraints  between  the  modules.  With  these  con¬ 
straints.  the  order  in  which  the  nodes  are  executed  is  restricted  severly  and  this 
complicates  the  problem  considerably. 

Before  the  impact  on  the  speedup  performance  is  presented,  the  com¬ 
ponents  of  the  delay  will  be  described  in  the  following  section. 

3.4.1.  Components  of  Comm  uni  cation  Delay 

In  identifying  the  components  of  the  communication  delay  in  a  network,  a 
simplifying  assumption  is  made  that  the  data  will  not  go  through  a  series  of 
intermediate  processors  before  it  is  forwarded  to  its  final  destination  processor. 
In  other  words,  a  fully  connected  network  topology  for  the  processors  is 
assumed.  The  fundamental  components  of  delay  are  as  follows: 

(1)  Propagation  Delay 

Thk  is  essentially  the  delay  associated  with  transversing  the  circuit.  It  is 

defined  as  the  ratio  of  the  circuit  length  to  the  signal  propagation  rate.  It  is 


a  function  of  the  physical  separation  between  processors. 

(2)  Transmission  Delay 

This  is  the  time  required  for  all  the  bits  of  data  to  be  put  on  or  fetch  from 
the  circuit  It  is  the  time  associated  with  the  execution  of  the  put(  )  and 
get(  )  functions  in  the  simulator,  SIMON  as  described  in  chapter  one. 

(3)  Overhead  Processing  Delay 

This  is  the  time  needed  by  the  processor  to  identify  which  destination  pro¬ 
cessor  the  data  is  being  forwarded  to.  It  also  includes  the  time  required  to 
store  the  transmitting  data  in  an  export  flfo  as  well  as  the  time  required  to 
fetch  the  data  from  an  import  flfo. 

In  this  dissertation,  the  communication  delay  refers  to  the  propagation 
delay.  The  second  and  third  components  represent  the  system  overhead  which 
are  ignored  in  the  simulation.  The  primary  concern  here  is  the  effect  of  delay  on 
the  speedup  performance.  When  a  processor  is  waiting  for  the  necessary  data 
from  the  other  processors  in  order  to  continue  to  continue  its  computation,  it 
basically  idling  serving  no  useful  purpose.  This  is  not  an  efficient  allocation  of 
computational  resources.  Hence  scheduling  techniques  taking  into  account  the 
communication  delay  have  to  be  developed.  This  will  be  dealt  in  the  next  two 
chapters.  In  the  remainder  of  this  chapter,  the  effect  of  the  delay  on  the 
speedup  ratio  achieved  is  presented.  Several  circuit  matrices  are  used  to  test 
the  Hu's  level  scheduling  algorithm  with  constant  delay  between  communicating 
processors.  The  assumption  of  constant  delay  simplifies  the  scheduling  problem 
in  the  following  sense.  In  a  realistic  multiprocessing  system  environment,  the 
delay  between  two  processors  depends  dynamically  on  the  factors  such  as  the 
traffic  pattern  in  the  interconnecting  switch,  the  topology  of  the  L  .erconnect 
network,  the  queuing  delay  etc.  These  are  usually  very  difficult  to  measure.  If  all 
these  factors  are  taken  into  account,  this  represents  a  completely  different 


scheduling  environment  called  a  stochastic  scheduling  problem. 

3.4.2.  Determination  of  Completion  Time  of  a  LU  Task  Graph 

In  this  section,  the  determination  of  the  completion  time  of  a  LU  task  graph 
based  on  Hu's  schedule  is  described  when  a  constant  delay  is  assumed  in  the 
switching  network  of  a  multiprocessing  system.  Another  assumption  is  that 
once  the  result  of  an  operation  is  generated,  it  is  forwarded  immediately  to  the 
other  processors  so  that  these  processors  can  continue  their  computation.  Also 
the  correct  order  of  the  arrival  of  data  is  assumed  in  the  receiving  processor  in 
order  to  assure  the  correct  execution  of  the  algorithm. 

The  determination  of  the  completion  time  of  a  LU  task  graph  when  there  is 
a  constant  delay  between  communicating  processors  is  not  as  simple  as  in  the 
case  when  the  delay  is  ignored.  In  the  latter  case,  the  completion  time  of  a  given 
graph  can  be  obtained  from  the  Grantt  chart  representing  the  schedule  of  the 
processors.  For  example,  if  the  LU  task  graph  of  Figure  3.1  is  scheduled  to  be 
executed  on  three  processors,  the  corresponding  Grantt  chart  is  shown  in  Fig¬ 
ure  3.2(b).  The  completion  time  of  this  schedule  is  7  units.  In  the  case  where 
there  is  a  delay,  it  is  necessary  to  go  through  all  the  operations  scheduled  to  be 
executed  at  each  timestep,  locating  the  predecessors  of  each  operation,  adding 
the  magnitude  of  the  delay  to  the  times  at  which  the  predecessors  are  executed 
by  the  processors  they  are  assigned  to,  and  then  finding  the  maximum  of  these 
times.  This  is  the  time  at  which  the  execution  of  this  node  will  be  completed.  A 
more  detailed  treatment  will  be  given  again  in  chapter  four  when  scheduling 
techniques  taking  into  account  the  communication  delay  are  discussed. 

However,  the  following  simple  example  will  illustrate  the  procedure  for 
determining  the  completion  time  of  a  schedule  when  there  is  a  delay.  A  segment 
of  a  three  processor  schedule  is  shown  in  Figure  3.5  at  timestep  equal  to  32.  At 
this  timestep,  node  l  is  assigned  to  processor  2.  It  is  necessary  to  search 


through  the  graph  (  the  connectivity  matrix  )  to  find  out  the  predecessors  of 
node  l.  Suppose  they  are  nodes  i,  j,  and  k.  Then  another  search  of  the  schedule 
of  each  orocessor  to  locate  these  predecessors  is  necessary.  Suppose  they  are 
locaU  as  shown  and  the  corresponding  times  at  which  they  are  completed  are 
15,  18  and  17  time  units  for  nodes  i,  j  and  k  respectively.  Assume  a  delay  of  20 
units  in  forwarding  a  message  from  the  sending  processor  to  the  receiving  pro¬ 
cessor,  since  node  j  and  node  l  are  residing  in  the  same  processor,  there  is  no 
need  to  consider  the  delay  of  transmitting  the  result  of  node  j  to  node  l.  If  pro¬ 
cessors  3  and  1  will  forward  the  results  after  the  execution  of  node  i  and  k 
immediately  to  processor  2,  the  result  of  node  i  will  arrive  at  processor  2  at 
time  equal  to  35  while  the  result  of  node  k  will  arrive  at  time  equal  to  37.  Hence 
the  time  at  which  processor  2  will  finish  the  execution  of  node  l  is  38  (  which  is 
equal  to  37.  the  maximum  delay  due  to  the  transmission  of  the  results  from  the 
predecessors  plus  1,  execution  time  of  node  l  ).  Then  the  same  procedure  is 
repeated  for  node  m  and  node  n  in  the  same  timestep.  After  all  the  nodes  are 
completed  at  this  timestep,  the  nodes  in  the  next  timestep  are  attempted.  This 
procedure  continues  until  the  last  timestep  is  reached  and  the  completion  time 
is  found. 

Having  found  the  completion  time,  the  speedup  ratio  is  obtained  by  talking 
the  number  of  nodes  in  the  task  graph,  which  is  numerically  equal  to  the  com¬ 
pletion  time  if  one  processor  is  used  assuming  one  unit  execution  time  for  each 
node,  divided  by  the  completion  time  obtained  as  described  in  the  last  para¬ 
graph. 

3.5.  Results  and  Discussion 

The  simulation  results  of  the  impact  of  the  delay  on  the  speedup  perfor¬ 
mance  are  shown  from  Figure  3.8(a)  to  Figure  3.6(c).  These  graphs  represent 
the  LU  decomposition  process  of  a  given  matrix.  The  number  of  nodes  in  the 


graph  is  the  number  of  operations  required  to  carry  out  the  decomposition  to 
completion.  These  schedules  are  obtained  by  Hu's  level  scheduling  algorithm 
and  different  delays  are  assumed  between  communicating  processors  to  see  the 
effect  on  the  completion  time  of  these  schedules.  As  seen  from  the  simulation 
results,  the  increase  in  delay  decreases  the  speedup  ratio  achieved.  When  it  is 
expected  that  the  communication  overhead  will  degrade  the  speedup  perfor¬ 
mance  unacceptably,  some  other  scheduling  algorithms  with  delay  considera¬ 
tion  should  be  employed. 

With  the  communication  delay  taken  into  account,  the  scheduling  problems 
become  even  more  complicated.  Without  the  delay,  any  available  node  can  be 
assigned  to  any  available  processor  without  regard  to  where  the  predecessors  of 
this  node  are  located.  With  the  communication  delay  consideration,  the  nodes  to 
processors  assignments  at  the  previous  timesteps  affect  the  assignment  at  the 
present  timestep.  Also  there  seems  to  be  no  possible  way  of  predicting  how  the 
present  assignment  will  affect  the  assignments  at  the  future  timesteps. 

Another  possible  tradeoff  between  parallel  execution  and  reduction  in  com¬ 
munication  delay  exists  in  this  scheduling  problem.  Based  on  where  the  prede¬ 
cessors  of  a  node  are  located,  the  presence  of  communication  delay  tends  to 
assign  this  node  to  the  same  processor  as  its  predecessors  have  been  assigned. 
This  tendency  to  schedule  nodes  to  the  same  processor  will  result  in  serial  exe¬ 
cution  of  these  nodes.  Hence  it  does  not  exploit  the  possible  parallelism  of 
these  nodes  in  the  task  graph.  Since  the  main  performance  criterion  here  is  to 
finish  the  task  graph  in  the  shortest  possible  time,  it  is  sometimes  beneficial  to 
execute  these  nodes  serially,  especially  when  large  delay  exists  in  the  switching 
network. 
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3.6.  Conclusion 


In  this  chapter,  a  common  class  of  deterministic  processor  scheduling  is 
presented.  Based  on  extensive  simulation  results.  Hu’s  level  scheduling  algo¬ 
rithm  seems  to  be  most  suitable  for  scheduling  the  LU  task  graph  for  con¬ 
current  execution  when  there  is  no  delay  between  communicating  processors. 
As  seen  from  the  simulation  results,  the  speedup  performance  is  almost  linear 
with  the  number  of  processors  when  this  number  is  well  below  the  bound  on  the 
minimum  number  of  processors  given  in  section  3.3.3. 

However,  the  presence  of  delay  overhead  degrades  this  speedup  perfor¬ 
mance  considerably  if  Hu’s  level  scheduling  algorithm  is  used  to  obtain  the 
nodes  to  processors  assignment.  Hence  scheduling  techniques  taking  into 
account  the  communication  delay  have  to  be  employed  in  this  case.  The  main 
objective  is  to  obtain  a  schedule  such  that  a  LU  task  graph  can  be  finished  in  the 
shortest  possible  time.  While  a  complete  solution  to  this  problem  appears  to  be 
inherently  intractable,  the  next  two  chapters  attempt  to  solve  this  problem 
through  the  use  of  heuristics. 
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Figure  3.2  Schedules  Obtained  by  Hu's  Level  Scheduling  for  the  Directed 

graph  in  Figure  3.1  with  (a)  Two  Processors  (b)  Three  Proces¬ 
sors  (c)  Four  Processors 
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Figure  3.3  Speed  Up  Ratio  Achieved  Using  Hu's  Level  Scheduling  for  the 
128-Node  Task  Graph 
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Different  Constant  Delays  for  the  128*Node  Task  Graph 


CHAPTER  4 


HEURISTIC  LOCAL  SCHEDULING  TECHNIQUES 

4.1.  Introduction 

We  have  seen  from  the  previous  chapter  that  large  communication  delay 
compared  to  the  node-  execution  time  degrades  the  speedup  performance  con¬ 
siderably.  In  order  to  increase  the  efficiency  of  the  LU  decomposition  algorithm 
in  a  multiprocessor  system,  scheduling  algorithms  taking  into  account  the  com¬ 
munication  delay  should  be  employed.  In  this  and  in  the  following  chapters, 
scheduling  algorithms  are  developed  to  minimize  the  completion  time  of  a  given 
LU  task  graph  assuming  constant  delay  on  each  edge  of  the  task  graph.  The 
scheduling  techniques  are  divided  into  two  classes.  One  class  is  the  local 
approach  which  is  discussed  in  this  chapter.  The  other  class  is  the  global 
approach  which  will  be  discussed  in  the  next  chapter. 

The  word  local  describes  the  scheduling  techniques  which  minimize  the 
completion  time  of  the  task  graph  at  each  timestep  by  performing  combinatorial 
matchings  of  nodes  ready  for  assignment  to  processors  at  that  timestep.  The 
word  global  refers  to  the  scheduling  technique  which  minimizes  the  completion 
time  by  considering  all  the  nodes  as  a  whole  in  the  task  graph.  In  this  chapter, 
four  heuristic  local  scheduling  techniques  are  discussed.  The  performance  of 
these  heuristics  is  measured  as  the  percentage  of  improvement  over  Hu’s 
scheduling  technique  in  the  presence  of  communication  delay.  The  precise 
definition  of  performance  improvement  will  be  defined  as  we  discuss  the  simula¬ 
tion  results  for  these  local  heuristics  at  the  end  of  the  chapter.  Before  we  go 
into  the  details  about  each  of  the  local  heuristic  scheduling  techniques,  let  us 
get  a  glimpse  of  the  complexity  and  the  difficulty  of  the  problem  involved. 


4.2.  Complexity  and  Difficulty  of  the  Problem 

Any  sequencing  algorithm  whose  complexity  is  bounded  by  a  polynomial  in 
the  size  of  the  problem  is  called  polynomial-time  algorithm.  However  there 
exists  a  class  of  problems  called  NP-complete  problems  which  are  equivalent  in 
the  following  sense.  If  you  can  find  a  polynomial-time  algorithm  to  solve  one  of 
these  problems,  one  can  solve  essentially  all  the  other  problems  in  that  class. 
Numerous  efforts  have  been  spent  over  the  years  unsuccessfully  in  the  search 
for  less  than  exponential  time  solution  to  these  problems.  There  is  strong  evi¬ 
dence  that  these  NP-complete  problems  are  indeed  inherently  intractable.  In 
the  domain  of  scheduling  theory,  almost  all  sequencing  problems  stated  in  their 
complete  generality  fall  into  this  category.  In  particular,  it  was  shown  that  in 
each  of  the  following  cases,  determining  an  optimal  nonpreemptive  schedule  is 
NP-complete.  The  proof  is  given  in[24]  . 

(1)  The  execution  time  of  tasks  (  nodes  )  is  arbitrary,  there  are  two  or  more 
processors  available. 

(2)  The  execution  time  of  tasks  (  nodes  )  is  one  time  unit,  the  precedence  rela¬ 
tion  among  tasks  (  nodes  )  is  arbitrary,  there  are  two  or  more  processors 
available. 

The  scheduling  of  nodes  of  a  LU  task  graph  is  similar  to  case  (2)  above.  The 
presence  of  delay  between  communicating  processors  complicates  the  determi¬ 
nation  of  the  optimal  schedule  further.  With  the  understanding  of  the  complexity 
and  the  difficulty  of  the  problem  we  are  facing,  a  sensible  approach  is  rather 
them  finding  the  optimal  schedule,  try  to  develop  some  heuristic  algorithms  with 
feasible  running  time  and  which  produce  reasonable  schedules,  in  the  sense  that 
these  schedules  will  be  better  than  Hu's  level  scheduling  algorithm  in  the  pres¬ 
ence  of  communication  delay. 


In  the  following  sections,  four  heuristic  algorithms  are  described.  For  the 
purpose  of  comparison,  they  are  named  heuristic  algorithms  D,  E,  F  and  EF. 
Each  of  these  algorithms  will  be  discussed  in  detail.  Some  definitions  and  termi¬ 
nology  pertaining  to  these  heuristics  are  given  in  the  following  sections. 

4.3.  Definitions  and  Terminologies 

A  set  of  ready  nodes  is  a  collection  of  nodes  of  a  given  graph  which  are 
ready  for  assignment  to  available  processors.  A  node  is  in  this  set  if  its  prede¬ 
cessors  have  been  executed.  In  other  words,  its  predecessors  have  been 
assigned  to  processors  in  the  previous  timesteps. 

The  completion  time  of  a  node  is  the  time  at  which  the  processor  to  which 
this  node  is  assigned  finishes  the  execution  of  this  node. 

The  elapsed  time  of  a  processor  is  the  time  at  which  it  completes  the  execu¬ 
tion  of  all  the  nodes  assigned  to  it.  The  elapsed  time  includes  the  transmission 
delay  as  a  result  of  having  the  predecessors  of  the  node  scheduled  to  different 
processors. 

The  partial  completion  time  of  a  task  graph  at  each  timestep  is  the  time 
required  by  all  processors  to  complete  the  execution  of  all  the  nodes  assigned  to 
the  processors  so  far.  This  partial  completion  time  again  includes  the  transmis¬ 
sion  delay  and  it  is  equal  to  the  maximum  of  all  the  elapsed  times  of  the  proces¬ 
sors.  If  at  the  last  timestep,  all  the  nodes  in  the  task  graph  have  been  scheduled, 
then  the  partial  completion  time  at  this  timestep  will  be  the  completion  time  of 
the  task  graph. 

The  following  simple  example  schedule  at  timestep  f*  will  illustrate  the 
above  definitions.  Suppose  at  timestep  U.  I  *i,*z.*3  I  is  a  set  of  ready  nodes  and 
we  have  three  processors  Pj.P2.P3.  The  predecessors  of  nodeii  are  nodes  ^  and 
Jba.  The  predecessors  of  node  iz  is  node  jz  and  the  predecessor  of  node  is  is  node 
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jg.  (  See  Figure  4.1(a)  )  Assume  that  these  predecessors  are  assigned  as  shown 
in  Figure  4. 1(b).  Suppose  the  delay  is  d  units  and  each  node  has  an  execution  of 
one  unit.  Suppose  further  that  an  assignment  is  made  such  that  node  is 
assigned  to  processor  plt  node  i2  is  assigned  to  processor  pz  and  node  ig  is 
assigned  to  processor  pg.  Let  et(pi)  denote  the  elapsed  time  of  processor/^, 
then  et(pi)  =  ti  +  d  +  l(  delay  in  transmitting  the  result  from  node  in  pro¬ 
cessor  pz  to  node  ii  in  processor  Pi  plus  the  node  execution  time  which  is  one  ), 
at  (pz)  -  U  +  d  +  1  (  delay  in  transmitting  the  result  from  node  jz  in  processor 
Ps  to  node  iz  in  processor  pz  plus  the  node  execution  time  which  is  one  ),  and 
«*(p3)  =  ti  +  1  (  no  delay  because  predecessor  of  node  ig.  which  is  jg  is  assigned 
to  the  same  processor  as  node  ig  ).  The  partial  completion  time  at  timestep  is 
the  maximum  of  $  et(p ,).  et(p2).  et(ps)  ).  which  is  ft  +  d  +  1.  Let  cf(i)  denote 
the  completion  time  of  node  i,  then  ct(ij)  “  et(pi),  cf(i2)  -  ef(p2), 
cf(i3)  =  et  (pa). 

In  the  later  discussion  of  local  heuristic  algorithms  E,  F  and  EF,  the  assign¬ 
ment  of  nodes  to  processors  is  viewed  as  a  classical  matching  problem  in  a  spe¬ 
cial  graph[26]  .  Hence  some  definitions  in  this  respect  are  necessary  and  simple 
examples  are  used  to  illustrate  these  definitions. 

A  bipartite  graph  is  a  graph  whose  nodes  can  be  partitioned  into  two  sets  5  and 
T,  so  that  each  edge  has  one  end  in  S  and  the  other  end  in  T.  The  edge  between 
node  i  and  node  j  is  denoted  by  edge(i,  j).  An  example  of  a  bipartite  graph  is 
shown  in  Figure  4.2(a). 

Let  G  =  (S,  T,  A)  be  an  undirected  bipartite  graph.  A  subset  A-  of  A  is  said  to  be 
a  matching  if  no  two  edges  in  X  are  incident  to  the  same  node.  For  example,  in 
Figure  4.2(b)  .  the  wavy  edges  are  in  a  matching.  That  is  X  *  {  edge(l,4), 
edge (2, 6)  j. 

With  respect  to  a  given  matching  X.  a  node  t  is  said  to  be  covered  if  there  is  an 


edge  in  .Af  incident  to  i  If  a  node  is  not  covered,  it  is  said  to  be  exposed  In  Fig¬ 
ure  4.2(b)  ,  nodes  (l),  (2),  (4)  and  (6)  are  covered  while  nodes  (3)  and  (5)  are 
exposed. 

For  a  given  matching  X,  an  alternating  path  is  a  path  of  edges  which  are  alter¬ 
nating  in  X  and  not  in  X.  An  augmenting  path  is  an  alternating  path  between  two 
exposed  nodes.  The  size  of  a  matching  is  measured  by  the  number  of  edges  in 
the  matching.  By  discovering  an  augmenting  path,  the  size  of  the  matching  can 
be  increased  by  one.  For  example,  in  the  graph  of  Figure  4.2(b)  ,  a  path  =  {  (3), 
(4).  (1),  (5)  l  is  an  augmenting  path  with  nodes  3  and  S  being  the  exposed  nodes. 
This  augmenting  path  is  also  shown  in  Figure  4.3(a)  .  The  following  procedure 
explains  why  it  is  an  augmenting  path.  The  matching  X  of  the  bipartite  graph 
shown  in  Figure  4.2(b)  contains  edge(l,4)  and  edge(2,6).  Hence  X  has  a  size 
equal  to  two.  By  excluding  edge(l,4 )  in  X  and  putting  edge(3.4)  and  edge(l,5) 
into  X  as  shown  in  Figure  4.3,  the  new  matching  now  contains  edge (3, 4), 
edge (1,5)  and  edge  (2,6).  This  new  matching  is  shown  in  Figure  4.4.  The  size  of 
new  matching  is  now  equal  to  three.  Hence  by  finding  the  augmenting  path  of  ( 
(3).  (4),  (l).  (5)  },  we  are  able  to  increase  the  size  of  the  matching  by  one. 

As  mentioned  earlier  in  this  chapter,  three  of  the  local  heuristic  schemes 
use  combinatorial  optimization  techniques  to  obtain  an  assignment  of  nodes  to 
processors.  Two  matching  algorithms  are  employed  to  obtain  such  an  assign¬ 
ment.  They  are  the  miqjnax  matching  algorithm  and  weighted  matching  algo¬ 
rithm.  The  detail  description  of  these  two  matching  algorithms  can  be  found  in 
standard  textbook  such  as[25].  For  completeness,  they  are  stated  below. 
Uinjiax  Matching  Problem 

Given  an  edge  weighted  bipartite  graph,  find  a  matching  containing  a  maximum 
number  of  edges  for  which  the  maximum  weights  of  the  edges  in  the  matching  is 
minimum.  The  mipjnax  matching  algorithm  uses  the  Augmenting  Path  Theorem 


which  states  that  a  matching  X  contains  a  maximum  number  of  edges  if  and  only 
if  it  admits  no  augmenting  paths.  The  algorithm  for  finding  the  matching  is 
called  Threshold  Method  for  min^jnax  matching  problem  and  it  is  described 
in[2S].  The  other  matching  algorithm  is  called  the  Weighted  Matching  Problem. 
and  is  stated  below. 

Weighted  Matching  Problem 

Given  an  edge-weighted  bipartite  graph,  find  a  matching  for  which  the  sum  of 
the  weights  of  the  edges  is  maximum.  The  algorithm  for  obtaining  the  matching 
is  also  described  in[25]. 

When  these  two  matching  algorithms  are  applied  to  these  local  heuristic 
techniques,  the  nodes-to-processors  scheduling  environment  with  the  considera¬ 
tion  of  communication  delay  should  be  mapped  into  a  weighted  bipartite  graph. 
The  detail  of  this  correspondence  will  be  discussed  as  we  describe  the  heuristics. 
With  these  definitions  and  terminologies  in  mind,  we  are  ready  to  describe  the 
four  local  heuristic  scheduling  techniques  in  the  following  sections. 

4.4.  Heuristic  Algorithm  D 

This  is  a  very  intuitive  approach  which  does  not  involve  any  of  the  combina¬ 
torial  techniques.  Our  main  objective  is  to  minimize  the  partial  completion  time 
at  each  timestep.  Given  a  set  of  ready  nodes,  a  node  with  the  highest  level  is 
chosen  from  the  set  for  assignment  to  processors.  Suppose  we  are  given  p  pro¬ 
cessors.  the  selected  node  is  assigned  to  these  processors  one  at  a  time  and  the 
elapsed  time  of  each  processor  is  found  by  taking  into  account  the  delay  when 
the  predecessors  of  the  chosen  node  are  located  in  different  processors.  For  this 
particular  node,  we  have  p  such  elapsed  times.  We  can  view  these  elapsed  times 
as  a  vector  ET  =  [•<!, . st^, . where  »f4  is  the  elapsed  time  of  pro¬ 

cessor  i  when  this  chosen  node  is  assigned  to  it.  Among  these  p  entries  in  the 
vector,  choose  the  minimum,  say  *tt  is  the  minimum,  and  assign  this  node  to 
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processor  f3.  If  there  axe  more  than  one  minimum,  pick  one  arbitrarily.  Then 
another  node  with  the  next  highest  level  is  chosen  from  the  set  of  ready  nodes. 
At  this  time,  we  are  left  with  (p  —  1)  processors  to  which  we  can  try  to  assign 

this  node.  So  the  vector  of  elapsed  time  ET  =  [c£ lt . ef^_ e£j+ . . efp]r 

has  only  (p  —  l)  entries.  Assign  this  node  to  the  processors  with  the  minimum 
elapsed  time  and  so  on.  Continue  this  process  until  we  have  finished  assigning  all 
the  nodes  in  the  ready  set  or  there  are  no  more  processors  available  at  this 
timestep.  Then  find  another  set  of  ready  nodes.  Repeat  the  above  assignment 
procedure  until  all  the  nodes  in  the  task  graph  are  scheduled.  The  high  level 
description  of  the  program  is  given  below: 

timestep  =  0 ; 

while  (  there  are  unassigned  nodes  ) 

f 

(1)  Find  the  set  of  ready  nodes  at  this  timestep. 

(2)  Pick  a  node  with  the  highest  level  from  this  set  of  ready  nodes. 

(3)  Assign  the  chosen  node  to  each  of  the  available  processors  and  find  all 
the  elapsed  time  of  these  processors. 

(4)  Among  all  the  elapsed  times,  assign  the  chosen  node  to  the  processor 
which  gives  the  minimum  elapsed  time.  If  there  are  more  than  one  pro¬ 
cessor  having  the  same  minimum.  Choose  one  arbitrarily. 

(5)  Reduce  the  number  of  available  processors  by  one.  Remove  the  node 
from  the  set  of  ready  nodes. 

(6)  If  (  there  are  still  processors  available  and  the  set  of  ready  nodes  is 
nonempty  )  Go  back  to  (2). 

Else  Increment  the  timestep. 


j  /•  end  of  while  •/ 


The  advantage  of  this  heuristic  algorithm  is  simple  and  the  running  time  is 
proportional  to  the  number  of  nodes  in  the  task  graph.  A  moment’s  thought 
reveals  the  fact  that  the  above  assignment  will  not  always  give  the  shortest  par¬ 
tial  completion  time  at  each  timestep.  The  obvious  reason  is  that  we  do  not  con¬ 
sider  all  the  nodes  in  the  ready-node  set  at  the  same  time.  Let  us  consider  a 
simple  and  hypothetical  case.  Suppose  we  have  two  processors  available  and  at 
timestep  equal  to  3  ,  there  are  only  two  nodes,  nodes  (3)  and  (4)  in  the  set  of 
ready  nodes  and  they  are  of  the  same  level.  Hence  there  are  only  two  possible 
schedules  as  shown  in  Figure  4.5(a)  and  Figure  4.5(b).  The  elapsed  time  of  the 
processors  are  as  shown.  In  schedule  (1),  the  elapsed  time  of  pi  is  5  and  the 
elapsed  time  of  pz  is  11.  In  schedule  (II),  the  elapsed  time  of  pi  is  4  and  the 
elapsed  time  of  pz  is  6.  If  node  (3)  is  selected  first,  we  have  schedule  (I)  and  on 
the  other  hand  if  node  (4)  is  selected  first,  we  have  schedule  (II).  Given  the  two 
possible  assignments  (I)  and  (II),  it  is  clear  that  schedule  (11)  is  better  than 
schedule  (1)  in  the  sense  that  schedule  (II)  has  a  shorter  partial  completion  time 
(  equal  to  6  in  this  example  )  than  schedule  (I)  (  equal  to  11  in  this  example  ). 
Hence  the  order  in  which  nodes  of  the  same  level  from  the  set  of  ready  nodes  is 
chosen  may  affect  the  final  completion  time  of  a  task  graph.  In  order  to  minim¬ 
ize  this  effect  and  to  get  the  shortest  partial  completion  time  at  each  timestep. 
all  possible  nodes-to-processors  combinations  have  to  be  considered  and  among 
all  these  possible  assignments,  we  choose  the  one  which  gives  the  shortest  par¬ 
tial  completion  time.  However  the  time  it  takes  to  obtain  all  these  possible  com¬ 
binations  is  proportional  to  the  factorial  of  the  number  of  processors  available 
as  well  as  the  number  of  nodes  ready  to  be  scheduled  at  that  timestep.  To  be 
specific,  letp  be  the  number  of  processors  available  and  let  n  be  the  number  of 
nodes  in  the  set  of  ready  nodes  at  a  certain  timestep.  There  are 


It  is  obvious  that  this  exhaustive  method  is  not  a  good  approach.  Therefore 
some  combinatorial  matching  algorithms  should  be  employed  to  minimize  the 
time  required  to  search  for  optimal  nodes-to-processors  assignments.  The 
minjnax  matching  is  well  suited  for  the  above  purpose.  It  will  find  a  matching  ( 
between  nodes  and  processors  )  to  obtain  such  an  assignment. 

However,  there  is  still  room  for  us  to  optimize  in  order  to  get  a  better 
schedule.  We  will  illustrate  this  point  again  based  on  the  following  example. 
There  are  cases  in  which  more  than  one  schedule  is  obtained  with  the  same 
minimum  partial  completion  time  at  a  certain  timestep.  The  question  is,  among 
these  minimum  partial  completion  time  schedules,  is  there  one  better  than  the 
other  ?  Let  us  look  at  the  following  simple  example  in  which  there  are  two  nodes 
in  the  set  of  ready  nodes  at  timestep  f<  equal  to  4.  and  there  are  only  two  avail¬ 
able  processors.  Suppose  we  have  the  following  two  possible  schedules  as  shown 
in  Figure  4.6.  Both  schedules  have  the  same  partial  completion  time  (  at  13  time 
units  ).  The  min_jnax  matching  algorithm  applied  at  this  timestep  will  give  either 
schedule  (1)  or  schedule  (II).  However  it  may  be  beneficial  to  have  schedule  (I) 
because  the  sum  of  the  elapsed  times  of  the  two  processors  p i  and  pz  is  less 
than  that  of  the  sum  of  the  elapsed  times  of  the  two  processors  in  schedule  (II). 
In  this  case,  it  may  be  possible  to  assigned  more  nodes  in  schedule  (I)  to  proces¬ 
sor  pi  after  the  execution  of  node  (5)  than  to  processor  pz  in  schedule  (II).  In 
other  words,  since  node  5  has  a  shorter  completion  time  (  at  5  time  units  )  inpi 
of  schedule  (1)  than  the  completion  time  (  at  9  time  units  )  inp2  of  schedule  (II), 
more  nodes  can  be  assigned  to  p\  in  schedule  (I)  after  timestep  equal  to  5. 
Hence  we  would  like  to  have  a  more  compact  schedule  at  each  timestep  so  that 
more  nodes  can  be  assigned  at  the  later  timestep  and  thus  reduce  the  comple¬ 


tion  time  of  the  task  graph. 

These  two  matching  algorithms  are  incorporated  in  the  next  three  local 
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heuristic  scheduling  algorithms.  The  detailed  mapping  of  the  nodes-to- 
processors  scheduling  environment  into  a  bipartite  graph  so  that  these  match¬ 
ing  algorithms  can  be  applied  will  be  described  as  we  discuss  the  heuristics  in 
the  next  few  sections. 


4.5.  Application  of  liinjfax  and  Weighted  Matchings  to  Heuristic  Algorithms 

In  the  next  three  heuristic  algorithms  E,  F  and  EF.  it  is  necessary  to  obtain 
a  weighted  bipartite  graph  at  each  timestep  representing  the  scheduling 
environment.  Let  us  describe  how  it  is  done.  Suppose  there  are  p  processors 
available  and  at  a  certain  timestep,  there  are  n  nodes  in  the  set  of  ready  nodes. 
Pick  min  \p.  n  j  nodes  from  this  set  and  assign  each  node  to  one  of  the  p  proces¬ 
sors,  find  the  elapsed  time  of  each  processor  taking  into  account  the  communi¬ 
cation  delay  due  to  the  fact  that  the  predecessors  of  that  node  are  assigned  to 
different  processors.  After  this  is  done,  we  have  represented  this  two- 
dimensional  data  in  a  matrix  of  size  p  by  min  {p.nj.  The  row  index  in  the 
matrix  corresponds  to  processor  number  and  the  column  index  corresponds  to 
the  node  number.  The  value  of  the  (i,  ^)Wl  element,  denoted  by  et^  in  the  matrix 
will  be  the  elapsed  time  of  processor  i  when  node  j  is  assigned  to  it.  With  the  pre¬ 
vious  notation  of  a  bipartite  graph  G  -  (5,  T,  A),  we  can  consider  the  processors 

Pi,  Pz . Pp  as  the  nodes  in  the  set  5,  the  nodes  tlt  iz . in  in  the  set  of  ready 

nodes  as  the  nodes  in  T,  and  the  elapsed  times  will  be  the  weights  on  the  edges 
of  the  bipartite  graph.  A  simple  bipartite  graph  representing  the  scheduling 
environment  is  shown  in  Figure  4.7  .  There  are  two  processors  PuPz  available 
and  there  are  three  nodes  ilt  iz.  is  ready  for  assignment  at  a  certain  timestep. 
The  matrix  representation,  ET  -  [  «f^  ]  of  the  weighted  bipartite  graph  of  Fig- 
tire  4.7  is  shown  below  : 


ET  = 


tfxt  ef  ig  «fl3 
•i*l  **23 


M 


* 


* 


•a 
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where  ety  is  the  elapsed  time  of  processor  i  when  node  j  ia  assigned  to  it. 

Although  these  three  heuristic  algorithms  are  based  on  the  combination  of 
the  miqjnax  matching  and  weighted  matching  algorithms  applied  at  each 
timestep,  it  is  not  a  straight  application  of  minjnax  matching  followed  by 
weighted  matching.  A  slight  modification  of  the  weights  of  the  bipartite  graph  is 
needed  after  a  matching  from  the  minjnax  algorithm  is  obtained.  The  reason  is 
that  the  weighted  matching  algorithm  will  find  a  nodes-to-processors  assignment 
with  the  sum  of  the  elapsed  times  of  the  processors  being  maximum.  However, 
what  we  would  like  to  have  is  an  assignment  such  that  the  sum  of  the  elapsed 
times  of  the  processors  is  minimum.  Hence  the  entries  which  are  the  elapsed 
times  in  the  matrix  (  bipartite  graph  )  obtained  from  the  mirxjnax  matching 
algorithm  are  modified  as  follow: 

mazjnf  =  Max{  efy  )xMin(p,  n  )  (4.1) 

sty  =  max_entxMin( p,  n  )  -  efy 

4.5. 1 .  Simple  Illustrative  Examples 

The  following  two  examples  will  illustrate  how  min_piax  and  weighted  match¬ 
ings  can  be  applied  to  obtain  the  heuristic  algorithms  described  in  the  next 
three  sections.  Consider  the  following  scheduling  environment  at  time  equal  to 
ft.  Suppose  there  are  two  processors  Pi.pz  available  and  there  are  two  nodes 
ii,  »2  ready  for  assignment.  Hence  there  are  only  two  possible  assignments.  One 
assignment  will  have  node  assigned  to  pi  and  node  ig  assigned  topg.  The  other 
assignment  will  have  node  tj  assigned  toj>2  and  node  tz  assigned  topi-  Further 
suppose  the  elapsed  times  of  the  processors  are  shown  in  the  following  matrix, 
ET  *  [  sty  ]  : 


The  above  matrix,  ET  is  a  representation  of  a  weighted  bipartite  graph  shown  in 
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Figure  4.8(a).  When  minjnax  matching  algorithm  is  applied  to  this  weighted 
bipartite  graph  (  the  detail  procedure  of  the  algorithm  can  be  found  in[25]  ).  the 
matching.  X  containing  edge( pit  i2)  and  edge(pz<  *i)  shown  in  Figure  4.8(b)  is 
obtained.  This  represents  the  assignment  of  node  ij  to  processor  p2  and  node  t2 
to  processor  p\.  This  assignment  has  a  shorter  partial  completion  time  of  3  units 
(  equal  to  Max  $  et l2  =  3,  ef2i  =  2  }  )  as  compared  to  the  other  assignment  which 
has  a  partial  completion  time  of  5  units  (  equal  to  Max  f  et n  =  1,  ef22  =  5  }  ). 
Hence  the  minjnax  matching  algorithm  will  always  give  an  assignment  which 
has  the  shortest  partial  completion  time  at  a  timestep. 

However,  it  is  sometimes  possible  to  have  more  than  one  assignment  with 
the  same  shortest  partial  completion  time.  Consider  the  following  scheduling 
environment  with  two  processors  pi,  p2  and  two  nodes  5,  8  ready  for  assignment 
at  timestep  equal  to  4.  Further  suppose  that  the  elapsed  times  of  the  processors 
are  shown  in  the  following  matrix,  ET  =  [  et ^  ] : 

5  13 
^  ~  [9  13 

The  corresponding  weighted  bipartite  graph  is  shown  in  Figure  4.9(a).  The  two 
possible  schedules  are  shown  in  Figure  4.6.  The  mirupnax  matching  will  result 
either  a  matching  shown  in  Figure  4.9(b)  which  corresponds  to  the  schedule  (1) 
of  Figure  4.8,  or  a  matching  shown  in  Figure  4.9(c)  which  corresponds  to  the 
schedule  (II)  of  Figure  4.8.  Note  that  these  two  possible  assignments  have  partial 
completion  time  of  13  units.  As  discussed  in  detail  at  the  end  of  section  4.4,  a 
compact  schedule  (  schedule  whose  sum  of  the  elapsed  times  of  the  processors 
is  smallest  )  is  preferred  because  more  nodes  can  be  assigned  at  the  later 
timestep  (  more  detailed  explanation  is  found  at  the  end  of  section  4.4  ).  Hence 
schedule  (I)  of  Figure  4.6  is  a  better  choice  than  schedule  (II)  in  the  same  figure. 
In  schedule  (I),  the  sum  of  the  elapsed  times  of  the  two  processors  is  18  units 
while  in  schedule  (II),  the  sum  of  the  elapsed  times  of  the  two  processors  is  22 
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units. 

The  problem  is  how  to  obtain  an  assignment  such  that  the  partial  comple¬ 
tion  time  at  a  certain  timestep  is  minimum  and  in  case  there  are  more  than  one 
assignment  with  the  same  partial  completion  time,  choose  the  assignment  whose 
sum  of  the  elapsed  times  of  the  processors  is  minimum.  To  obtain  the  minimum 
partial  completion  time,  we  apply  the  min_inax  matching  as  explained  in  the  last 
paragraph.  To  obtain  the  smallest  sum  of  elapsed  times  of  processors,  weighted 
matching  is  employed.  Since  weighted  matching  will  find  a  matching  such  that 
the  sum  of  the  weights  (  which  are  the  elapsed  times  in  this  application  )  on  the 
edges  is  maximum,  and  the  scheduling  assignment  requires  the  sum  of  the 
weights  to  be  minimum,  the  weights  on  the  bipartite  graph  should  be  modified. 
Essentially,  we  have  to  change  the  smaller  weights  to  larger  weights  and  larger 
weights  to  smaller  weights  on  the  bipartite  graph.  This  can  be  accomplished  by 
modifying  the  entries  of  the  matrix  ET  according  to  equation  (4. 1).  Hence  in  this 
example,  the  matrix  of  the  modified  graph  will  be  as  follow  : 


ET  = 


21  13 
17  13 


Applying  the  weighted  matching  to  the  modified  bipartite  graph,  we  will  obtain 
the  matching  shown  in  Figure  4.9(b)  which  corresponds  to  the  schedule  (1)  of 
Figure  4.6. 

The  heuristic  algorithms  E,  F  and  EF  are  based  on  the  application  of  these 
two  combinatorial  algorithms.  Nodes  from  the  set  of  ready  nodes  are  chosen  at 
each  timestep  and  minjnax  and  weighted  matchings  are  employed.  The 
difference  between  these  heuristics  is  the  criterion  we  are  using  to  select  the 
candidate  nodes  for  assignment  to  processors. 


4.6.  Heuristic  Algorithm  E 


In  this  heuristic,  the  candidate  nodes  are  the  nodes  with  the  highest  level 
taken  from  the  set  of  ready  nodes.  The  high  level  description  of  the  program  is 
given  below: 

timestep  =  0  ;  p  processors  are  available 

while  (  there  are  unassigned  nodes  ) 

(1)  Obtain  a  set  of  ready  nodes. 

(2)  Count  the  number  of  ready  nodes  at  this  timestep,  say  it  is  equal  to  m. 

(3)  If  (  m  >  p  )  choose  p  nodes  from  the  set  of  ready  nodes  with  highest 
level 

Else  choose  all  m  nodes  from  the  set. 

(3)  Form  the  matrix  (p  by  min(p.m)  )  whose  entries  are  the  elapsed  times 
of  the  processors. 

(4)  Find  the  maximum  value  of  the  entry  in  the  matrix,  say  it  is  equal  to 
max_fnt. 

(5)  Find  the  minimum  value  of  the  entry  in  the  matrix,  say  it  is  equal  to 
min_ent. 

(6)  threshold  = 

(7)  for  (;:)/•  loop  for  min_piax  */ 

{  find  all  edges  whose  weights  =  threshold  ;  /•  we  have  a  bipartite 
graph  •/ 

find  aug  nenting  paths  in  the  graph  ; 
update  the  matching  ; 

find  the  number  of  edges  in  the  matching  ;  say  it  is  equal  to  num_edge  ; 
if  (  num _gdge  =  min  (  p,m  )  )  break  ;  /•  done  with  min_jnax  •/ 


else 


threshold  ++  ; 


{  /*  end  of  loop  for  min_rnax  •/ 

(8)  For  this  most  current  matrix  (  bipartite  graph  ),  update  each  entry  as 
described  in  the  last  section. 

(9)  Apply  weighted  matching  algorithm  to  this  matrix  to  obtain  the  final 
assignment. 

(10)  Increment  the  timestep. 

}  /•  end  of  while  loop  •/ 

In  summary,  in  this  heuristic  algorithm  E,  only  the  nodes  with  the  highest 
level  are  the  candidates  to  be  scheduled  to  processors.  Hence  it  is  not  much 
different  from  the  level  scheduling  technique.  Also  in  the  case  where  the  number 
of  nodes  in  the  set  of  ready  nodes  at  a  timestep  is  larger  than  the  number  of 
processors  available,  we  only  choose  the  number  of  nodes  equal  to  the  number 
of  processors.  By  excluding  some  nodes  from  the  set  of  ready  nodes,  this  may 
lead  to  an  assignment  which  will  not  give  the  shortest  partial  completion  time  at 
this  timestep.  In  the  next  section,  we  will  discuss  another  heuristic  algorithm  F 
which  is  a  modification  of  the  heuristic  E. 

4.7.  Heuristic  Algorithm  F 

The  modification  to  the  previous  heuristic  algorithm  is  based  on  the  fact 
that  we  should  consider  all  the  nodes  in  the  set  of  ready  nodes  no  matter  how 
many  nodes  are  there  in  that  set  Hence  the  concept  of  the  level  of  a  node  is  not 
incorporated  in  this  heuristic  algorithm.  This  heuristic  scheduling  algorithm 
considers  all  the  nodes  ready  for  assignment  to  all  the  available  processors  tak¬ 
ing  into  account  the  communication  delay  between  processors  and  then  it  finds 
an  assignment  which  gives  the  smallest  partial  completion  time  at  that 
timestep.  If  the  size  of  the  set  of  ready  nodes  is  larger  than  the  number  of  pro¬ 
cessors  available,  the  candidate  nodes  are  not  necessarily  restricted  to  the 


nodes  with  the  highest  level.  If  the  size  of  the  set  of  ready  nodes  is  smaller  than 
the  number  of  processors  available  at  each  timestep,  then  heuristic  F  is  identi¬ 
cal  to  heuristic  E.  The  heuristic  algorithm  F  is  described  below  : 

timestep  sO;p  processors  are  available 

while  (  there  are  unassigned  nodes  ) 

(1)  Obtain  a  set  of  ready  nodes. 

(2)  Count  the  number  of  nodes  in  this  set,  say  it  is  equal  to  m. 

(3)  Form  the  matrix  (  bipartite  graph  )  of  p  by  m. 

(4)  Apply  miruTiax  matching  algorithm. 

(5)  Apply  weighted  matching  on  the  modified  bipartite  graph  as  discussed 
before  to  obtain  the  final  assignment 

(6)  Increment  the  timestep. 

}  /*  end  of  while  •/ 

Beside  the  motivation  of  considering  all  the  nodes  in  the  set  of  ready  nodes 
so  that  all  possible  assignments  are  taken  into  account  the  other  reason  is  that 
we  mil  get  a  more  "compact"  schedule  in  which  processors  will  have  less  idle 
time.  In  this  case,  nodes  with  lower  level  may  get  scheduled  ahead  of  those 
nodes  with  higher  level.  One  might  think  that  by  delaying  the  execution  of  nodes 
with  higher  level  will  increase  the  final  completion  time  of  the  task  graph.  This  is 
true  in  the  case  where  there  is  no  communication  delay  or  when  the  delay  is 
small  compared  to  the  node  execution  time.  In  the  situation  where  large  delay 
exists  between  communicating  processors,  if  we  only  pick  nodes  with  high  level 
as  the  candidate  nodes,  they  may  have  large  node  completion  times  resulting  a 
long  idle  times  in  the  processors.  It  may  be  beneficial  to  schedule  nodes  with 
smaller  node  completion  times  first  at  this  timestep  to  "fill”  in  those  idle  times 
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even  though  these  nodes  are  of  lower  level. 

In  summary,  this  heuristic  is  solely  based  on  the  fact  that  shortest  partial 
completion  time  of  processors  is  the  primary  concern  without  regarding  to  the 
level  of  a  node.  It  is  aimed  at  handling  the  situation  where  large  delay  compared 
to  the  node  execution  time  exists  between  communicating  processors.  In  the 
next  section,  another  heuristic  algorithm  EF  will  be  discussed.  It  represents 
another  possible  approach  between  these  two  extremes  in  which  the  candidate 
nodes  must  have  high  level  (  heuristic  E  )  and  the  candidate  nodes  should  have 
short  node  completion  times  {  heuristic  F  ). 

4.8.  Heuristic  Algorithm  EF 

In  the  previous  two  sections,  two  different  concepts  are  employed  in  two 
different  local  heuristic  scheduling  algorithms.  One  is  based  on  the  concept  of 
levels  of  ready  nodes,  the  other  is  based  on  the  concept  of  the  completion  times 
of  the  ready  nodes.  In  this  heuristic  EF,  the  candidate  nodes  for  assignment  to 
processors  are  based  on  the  weighted  sum  of  their  levels  and  their  node  comple¬ 
tion  times.  To  be  more  specific,  let  f!  be  a  tuning  parameter  which  has  a  value 
between  zero  and  one,  let  Ivl ^  be  the  level  of  a  given  node  i  and  let  cfy  be  the 
completion  time  of  node  i  when  it  is  assigned  to  processor  j.  Then  the  entry 
*(i.  j)  in  the  matrix  (i.e.  weight  on  the  bipartite  graph  )  is  assigned  the  follow¬ 
ing  value  : 

IvL 

where  rruaxj,vl  is  the  maximum  level  in  the  task  graph.  The  program  flow  is 
identical  to  heuristic  F  except  the  entries  in  the  matrix  are  computed 
differently. 

In  summary,  this  heuristic  gives  an  alternative  way  of  scheduling  nodes  in 
the  presence  of  communication  delay.  It  is  expected  on  the  average  to  perform 


better  than  the  previous  two  heuristics  E  and  F.  However  we  cannot  say  that  it  is 
always  the  case  because  of  the  heuristic  nature  of  tnese  algorithms.  In  conclud¬ 
ing  the  description  of  all  these  four  heuristic  algorithms,  an  important  question 
is  the  order  of  complexity  of  these  heuristics.  This  will  be  discussed  in  the  next 
section. 

4.0.  Complexity  of  these  Heuristic  Algorithms 

The  first  heuristic  D  described  in  this  chapter  is  different  from  the  remain¬ 
ing  three  heuristics  in  the  sense  that  no  combinatorial  optimization  technique  is 
employed  in  heuristic  D.  Recall  that  in  this  heuristic,  one  node  is  scheduled  from 
the  set  of  ready  nodes  and  it  is  assigned  to  a  processor  until  all  the  nodes  in  the 
set  are  scheduled  at  this  timestep.  Then  a  new  set  of  ready  nodes  is  formed  at 
the  next  timestep.  The  above  process  is  repeated  until  all  the  nodes  in  the  task 
graph  are  assigned.  It  is  clear  that  the  complexity  of  this  simple  heuristic  algo¬ 
rithm  is  linear  in  the  size  of  the  task  graph.  That  is  0(n).  where  n  is  the  number 
of  nodes  in  the  graph. 

The  other  three  heuristics  E,  F  and  EF  are  based  on  the  application  of 
min_jnax  and  weighted  matching  algorithms  at  each  timestep.  Let  |  5  |  denote 
the  number  of  nodes  in  the  set  5  of  a  bipartite  graph.  ln[25] ,  it  is  shown  that  if  | 
5  |  =  m,  and  |  T  |  =  n,  the  complexity  of  the  threshold  method  for  minjnax 
matching  is  Oimhi2)  while  for  the  weighted  matching,  the  complexity  is 
O^m^n).  Suppose  a  given  task  graph  has  n  nodes  and  there  acre  p  processors 

available,  it  requires  0(~)  timestep?  to  schedule  all  n  nodes  in  the  task  graph 

to  the  processors.  At  each  timestep,  the  size  of  the  set  of  ready  nodes  is 
bounded  by  n.  Hence  the  order  of  complexity  of  performing  minjnax  and 

weighted  matchings  is  0(n2p 2  +•  n*p ).  Since  it  takes  the  order  of  ( timesteps 
to  schedule  all  the  nodes,  the  complexity  will  be  0((^-)x(n2pz  +  n*p))  which  is 


4. 10.  Simulation  Results  and  Discussion 


Six  sparse  matrices  extracted  from  six  benchmark  circuits  in  SPICE[l]  are 
used  to  test  the  performance  of  the  four  heuristic  algorithms,  the  performance 
is  measured  in  terms  of  how  much  percentage  of  improvement  in  speedup  ratio 
compared  to  Hu’s  level  scheduling  technique  is  achieved  when  we  have  constant 
delay  between  communicating  processors. 


7,  of  improvement  = 


oeedup  ratio  (Heuristic )  -  speedup  ratio  (Hu) 


T  speedup  ratio  (Hu) 

These  matrices  are  of  different  sizes  ranging  from  a  12  by  12  matrix  to  a  32  by 

32  matrix.  The  number  of  operations  necessary  for  the  LU  decomposition  ( 

number  of  nodes  in  the  LU  task  graph  )  ranges  from  58  (  from  the  12  by  12 

matrix  )  to  767  (  from  the  32  by  32  matrix  ).  The  number  of  processors  and  the 

delays  are  the  two  parameters  used  in  testing  out  the  performance  of  the 

heuristics.  In  addition,  the  tuning  factor  /?  described  in  heuristic  EF  is  also  a 

parameter  in  generating  a  family  of  curves.  For  the  delay,  we  assume  that  each 

operation  takes  one  unit  execution  time  and  the  delays  are  integral  multiple  of 

the  node  execution  time.  Delays  of  1,  5,  10,  and  20  time  units  are  used.  Notice 

that  these  tests  are  performed  with  the  understanding  that  the  results  are  by  no 

means  representative.  How  well  these  heuristic  algorithms  perform  depends  on 

the  structure  of  the  task  graph,  the  number  of  processors  available  and  the 

value  of  delay.  In  other  words,  these  algorithms  might  give  better  results  for  one 

graph  than  another.  However,  we  expect  that  these  heuristics,  in  general,  will 

give  better  speedup  ratio  than  Hu’s  level  scheduling  algorithm  when  there  is 

communication  delay  between  processors  exchanging  data 


The  simulation  results  can  be  discussed  in  two  categories.  In  one  category, 
the  performance  of  each  heuristic  with  different  delays  is  discussed.  The  perfor* 


mance  improvement  for  heuristic  D  applied  to  six  graphs  are  shown  in  Figure 
4.10  and  Figure  4.11.  The  number  of  operations  (  nodes  )  for  each  graph  is  shown 
on  each  figure.  The  results  for  heuristic  E,  F  and  EF  are  shown  from  Figure  4.12 
to  Figure  4.13,  from  Figure  4.14  to  Figure  4.15  and  from  Figure  4.16  to  Figure 
4.17  respectively.  For  heuristic  D,  E  and  F,  in  most  of  the  cases,  it  is  observed 
that  as  the  delay  increases,  the  percentage  of  improvement  over  Hu  increases 
for  a  given  number  of  processors.  For  heuristic  EF,  the  same  general  trend  is 
observed.  However  there  are  cases  in  which  the  delay  increases  and  the  perfor¬ 
mance  decreases  for  a  certain  value  of  0.  As  discussed  above,  there  is  no 
definitive  conclusion  which  says  that  for  a  given  set  of  parameters  such  as  0. 
delay  and  the  number  of  processors,  we  will  achieve  a  certain  speedup  ratio. 
But  in  nearly  all  of  the  simulation  results,  these  heuristic  algorithms  do  indeed 
tend  to  give  better  schedules  than  Hu's  level  scheduling  technique.  On  the  aver¬ 
age.  the  heuristic  D  gives  5%  to  15 %  of  improvement,  heuristic  E  gives  10%  to 
25%,  heuristic  F  gives  20%  to  35%  and  there  are  cases  it  gives  40%  to  50%  of 
improvement.  For  heuristic  EF,  it  gives  on  the  average  50%  to  80%  for  some  0. 
Sometimes  for  a  certain  value  of  0,  we  achieve  over  100%  of  improvement.  A  final 
remark  with  respect  to  this  category  is  that  all  these  heuristics  do  not  give 
much  improvement  especially  heuristic  F  for  small  delay.  In  some  cases,  F  gives 
schedules  which  have  longer  final  completion  time  than  Hu's.  This  may  lead  to 
the  conclusion  that  probably  Hu's  level  scheduling  algorithm  is  one  of  the 
appropriate  algorithms  to  use  when  there  is  small  delay  or  when  there  is  no 
delay  at  all. 

In  the  other  category,  these  four  heuristic  are  compared  with  each  other 
for  a  given  delay.  For  heuristic  EF,  only  three  values  of  0  (  0.1,  0.5,  0.9  )  are 
chosen  to  compare  with  the  other  three  heuristics.  The  performance  com¬ 
parison  of  these  four  heuristics  for  the  graph  with  58  nodes  with  different  delays 


H 
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are  shown  in  Figure  4.18  and  Figure  4.19.  The  results  for  the  graphs  with  128 
nodes.  295  nodes,  644  nodes,  725  nodes  and  767  nodes  for  different  delays  are 
shown  from  Figure  4.20  to  Figure  4.21,  from  Figure  4.22  to  Figure  4.23,  from  Fig¬ 
ure  4.24  to  figure  4.25,  from  Figure  4.26  to  Figure  4.27  and  from  figure  4.28  to 
Figure  4.29  respectively.  In  general,  F  performs  better  than  E  which  in  turn  per¬ 
forms  better  than  D  for  large  delay.  As  discussed  in  the  last  paragraph,  in  the 
case  of  small  delay.  F  performs  worse  than  E  and  D.  lhis  shows  that  for  small 
delay,  the  level  of  a  node  should  be  used  as  a  criterion  for  selecting  candidate 
nodes  while  for  large  delay,  the  node  completion  time  is  the  primary  considera¬ 
tion  in  choosing  the  nodes.  The  heuristic  EF  is  developed  based  on  the  above 
observation  that  it  may  be  advantageous  to  consider  the  weighted  sum  of  the 
level  of  a  node  and  its  node  completion  time.  In  fact  it  is  shown  in  the  results 
that  there  are  some  values  of  the  tuning  parameter  /S  which  give  quite  impres¬ 
sive  speedup  performance. 

4.11.  Conclusion 

In  this  chapter,  four  local  heuristic  scheduling  algorithms  are  discussed. 
These  heuristics  are  developed  to  assign  nodes  in  a  LU  task  graph  to  processors 
when  there  is  a  fixed  communication  delay  between  processors  exchanging  data. 
The  heuristic  D  assigns  nodes  one  at  a  time  from  the  set  of  ready  nodes  at  each 
timestep.  The  other  three  heuristics  perform  combinatorial  matching  between 
nodes  in  the  node  ready  set  and  the  processors  to  obtain  an  assignment  at  each 
timestep.  In  most  of  the  cases,  promising  results  in  which  performance  is  meas¬ 
ured  as  the  percentage  of  improvement  in  speedup  ratio  compared  to  Hu’s  level 
scheduling  technique  are  obtained.  Note  that  we  are  trying  to  achieve  the  shor¬ 
test  completion  time  of  the  task  graph  heuristically  by  minimizing  the  partial 
completion  times  of  the  processors  at  each  timestep.  In  the  next  chapter, 
smother  technique  is  attempted  which  tries  to  minimize  the  completion  time  by 
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considering  one  aspect  of  the  structure  of  a  given  task  graph,  namely  the  crifi- 
cal  path  of  that  graph.  By  taking  into  consideration  all  the  nodes  in  the  task 
graph,  it  is  expected  that  this  global  technique  gives  shorter  completion  time 
than  the  techniques  described  in  this  chapter. 


Processors 


Figure  4.1 


(a)  Fragment  Of  An  Example  Graph,  (b)  Schedule  Showing  The 
Assignment  Of  Nodes  ij,  if,  tj  To  Processors  Pi.Pi.Ps  At 
fimestep  £* 


(a)  An  Augmenting  Path  For  The  Matching  In  The  Bipartite 
Graph  Of  Figure  4.2(a).  (b)  Excluding  adgt(l,4)  From  And 
Including  adga(3.4)  And  adga(l,5)  Into  The  Matching. 


Figure  4.4 


The  New  Matching  After  Finding  The  Augmenting  Path  Of  {  (3) 
(4).  (1).  (5)  J  In  The  Matching  Shown  In  Figure  4.2(b). 
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Figure  4.S  Dependency  Of  The  Schedule  Obtained  On  The  Order  In  Which 
Nodes  Are  Selected,  (a)  Node  (3)  Is  Selected  First,  (b)  Node  (4) 
Is  Selected  First. 
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(b) 


8  Two  Pouible  Schedules  With  Same  Partial  Completion  Time. 
Nodes  Can  Be  Assigned  To  Processors  Earlier  in  The  Later 
■p  me  step  After  Node  (5)  in  (a)  Than  in  (b). 


Figure  4.8 


(a)  Weighted  Bipartite  Graph  Representation  Of  The  Schedu 
Environment  Of  Two  Processors  pjt  p*  and  Two  Nodes  i,.  ig. 
The  Matching  Obtained  By  MinJfax  Matching  Algorithm. 


figure  4.9  (•)  Weighted  Bipartite  Craph  Representation  Of  The  Scheduling 

Environment  Of  Two  Processors  Pi.Pt  And  Two  Nodes  (5).  (B). 
The  WiruWax  Watching  Gives  Either  (b)  Or  (c).  Weighted  Watch¬ 
ing  Applied  To  The  WodiAed  Weights  On  The  Graph  Of  (a)  Will 
Result  In  (b)  Only. 
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Figure  4.10  Percentage  Of  Speedup  Improvement  of  Heuristic  D  Over  Hu's 
Level  Scheduling  With  Different  Constant  Delays  For  The  Follow¬ 
ing  Task  Graphs,  (a)  56  Nodes  (b)  128  Nodes  (c)  295  Nodes 
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Figure  4.11  Percentage  Of  Speedup  Improvement  of  Heuristic  D  Over  Hu's 
Level  Scheduling  With  Different  Constant  Delays  For  The  Follow- 
teg  Task  Graphs,  (a)  644  Nodes  (b)  725  Nodes  (c)  767  Nodes 
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Figure  4.15  Percentage  Of  Speedup  Improvement  of  Heuristic  F  Over  Hu’s 
Level  Scheduling  With  Different  Constant  Delays  For  The  Follow¬ 
ing  Task  Graphs,  (a)  044  Nodes  (b)  725  Nodes  (c)  767  Nodes 
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Figure  4.16  Percentage  Of  Speedup  Improvement  of  Heuristic  EF  (  (!  *  0.5  ) 
Over  Hu’s  Level  Scheduling  With  Different  Constant  Delays  For 
The  Following  Task  Craphs.  (a)  56  Nodes  (b)  126  Nodes  (c)  295 
Nodes 
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Comparison  of  The  Speedup  Performance  Of  Heuristics  D,  E, 
EF  (0s  0.1,  0.5,  0.9  )  For  Task  Graph  With  58  Nodes  (a)  Delay 
10  Unit,  (b)  Delay  Of  20  Units. 
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Figure  4.21  Comparison  of  The  Speedup  Performance  Of  Heuristics  D,  E,  F, 
EF  (  fi  s  0.1.  0.5,  0.9  )  For  Task  Craph  With  128  Nodes  (a)  Delay 
of  10  Unit,  (b)  Delay  Of  20  Units. 
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Figure  4.22  Comparison  of  The  Speedup  Performance  Of  Heuristics  D,  E.  F. 

EF  (  fl  =  0.1,  0.5.  0.0  )  For  Task  Graph  With  295  Nodes  (a)  Delay 
of  1  Unit,  (b)  Delay  Of  5  Units. 
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Figure  4.24  Comparison  of  The  Speedup  Performance  Of  Heuristics  D.  E,  F, 
EF  (  ft  a  0.1,  0.5,  0.9  )  For  Task  Graph  With  844  Nodes  (a)  Delay 
of  1  Unit,  (b)  Delay  Of  5  Units. 
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Figure  4.25  Comparison  of  The  Speedup  Performance  Of  Heuristics  D.  E.  F, 
EF  (  fi  =  0.1,  0.5,  0.9  )  For  Task  Craph  With  644  Nodes  (a)  Delay 
of  10  Unit  (b)  Delay  Of  20  Units. 
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Figure  4.27 


Comparison  of  The  Speedup  Performance  Of  Heuristics  D,  E.  F, 
EF  (  0  =  0.1,  0.5,  0.0  )  For  Task  Graph  With  725  Nodes  (a)  Delay 
of  10  Unit  (b)  Delay  Of  20  Units. 
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Figure  4.29  Comparison  of  The  Speedup  Performance  Of  Heuristics  D,  E,  F, 
EF  (  fi  =  0.1.  0.5,  0.9  )  For  Task  Graph  With  767  Nodes  (a)  Delay 
of  10  Unit,  (b)  Delay  Of  20  Units. 
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CHAPTER  5 


HEURISTIC  GLOBAL  CLUSTERING  TECHNIQUE 

5.1.  Introduction 

In  this  chapter,  another  heuristic  technique  is  described  to  minimize  the 
completion  time  of  a  given  LU  task  graph  with  constant  communication  delay  on 
each  edge.  This  technique,  which  we  call  global  clustering,  clusters  nodes 
together  to  eliminate  the  communication  delay  among  these  nodes.  By  doing  so, 
the  time  required  to  complete  the  whole  task  graph  is  shortened.  This  heuristic 
is  different  from  those  described  in  the  previous  chapter.  Those  local  heuristic 
algorithms  try  to  minimize  the  completion  time  at  each  time  step  by  performing 
combinatorial  mapping  between  a  set  of  ready  tasks  at  that  particular  time  step 
and  the  processors.  In  this  global  heuristic  approach,  the  algorithm  is  applied 
iteratively  on  the  whole  directed  graph.  The  performance  of  the  local  and  global 
algorithms  in  terms  of  the  speedup  ratio  achieved  will  be  compared  at  the  end 
of  this  chapter. 

The  heuristic  global  clustering  technique  involves  two  steps,  the  finding  of 
the  critical  (  longest)  path  in  a  directed  graph  and  the  clustering  of  the  nodes. 
These  two  steps  are  described  in  more  detail  in  the  rest  of  the  chapter.  Simple 
examples  are  used  to  illustrate  the  concepts. 

5.2.  Finding  the  Critical  Path  in  a  Directed  Graph 

Before  we  go  into  the  details  of  the  algorithm  for  finding  the  critical  path,  a 
few  definitions  are  necessary. 

A  directed  graph  G  =  (N,  E,  <)  is  a  graph  which  consists  of  a  set  of  nodes  and 
a  set  of  edges.  Each  node  has  a  number  for  its  identification  and  each  edge  has  a 


non-negative  number  tuy  associated  with  it,  which  we  call  its  weight.  It 
represents  the  communication  delay  between  the  two  nodes  connected  by  this 
edge.  Each  node  also  has  its  own  weight  r<  representing  the  node  execution 
time.  The  precedence  constraint  between  a  pair  of  nodes  is  denoted  by  i  <  j 
which  means  that  node  i  is  the  immediate  predecessor  of  node  j . 

The  length  of  a  path  is  defined  as  the  sum  of  the  weights  of  the  edges  as  well 
as  the  weights  of  the  nodes  which  constituent  that  path. 

The  critical  path  is  the  length  of  the  longest  path  from  the  node  to  the  ter¬ 
minal  node. 

The  algorithm  for  finding  the  critical  path  in  a  directed  graph  is  called  the 
Modified  Cascade  Algorithm.  The  idea  comes  from  the  original  Cascade  Algo¬ 
rithm  for  finding  the  shortest  path  in  a  directed  graph.  A  few  modifications  are 
made  to  the  original  Cascade  Algorithm  in  order  to  search  for  the  longest  path. 
Hence  a  careful  understanding  of  the  methods  for  finding  the  shortest  path  is 
necessary,  of  which  there  are  quite  a  few.  In  most  of  the  methods,  a  matrix 
called  the  distance  matrix  D  associated  with  the  graph  is  formed.  The  elements 
of  the  distance  matrix  dy  are  defined  as 

if  i  <  j  ,  i  *  j  ; 
dy  =  0  if  i  =  ; 

•»  otherwise ; 

In  the  case  of  an  undirected  graph,  the  distance  matrix  is  symmetric  while  for  a 
directed  acyclic  graph,  the  distance  matrix  is  asymmetric. 

For  completeness,  we  will  mention  another  method  available  for  finding  the 


shortest  path  in  which  the  distance  matrix  is  not  necessary.  It  is  based  on  the 
solution  of  a  system  of  nonlinear  equations  obtained  from  the  theory  of  dynamic 
programming[28]. 


5.2. 1.  Dynamic  Programming  Approach 


Given  a  graph  with  N  nodes,  a  system  of  equations  in  which  /*  are  the  unk¬ 
nown  quantities  are  formed 

/*  =  min,*  {  dy  +  Si  )  *  *  1.2 . N- 1  (5.1) 

fN  =  0 

where  /*  =  shortest  distance  from  the  node  i  to  the  destination  node  N. 

The  minimization  is  taken  over  all  j  which  can  be  reached  directly  from  i .  The 
method  of  successive  approximations  is  used  to  solve  the  above  system  accord¬ 
ing  to  the  following  steps: 

(1)  Pick  an  initial  guess  //l)  (i=l,2 . N- 1)  and  choose  / ^  -  0. 

(2)  Update  the  next  approximation  / ^  using  the  recursive  relations: 

/t(fc)  =  min>f.t  (  dy  +  //fc_1)) 

fP  =  0.  *  =  2,  3.... 

After  solving  all  the  /t’s,  it  is  not  difficult  to  deduce  the  shortest  path.  It 
has  been  proved  in  [l]  that  the  method  of  successive  approximations  converges 
to  a  unique  solution. 

5.2.2.  The  Cancade  Algorithm 

This  algorithm  is  based  on  the  method  by  Farbey,  Land  and  Murchland[27]  . 
A  brief  description  of  the  algorithm  will  be  given,  followed  by  a  simple  example. 
In  the  next  section,  the  Modified  Cascade  Algorithm  for  finding  the  critical  path 
in  the  LU  task  graph  based  on  the  modifications  of  the  original  Cascade  Algo¬ 
rithm  will  be  discussed. 

First  of  all,  let  us  define  an  elementary  multiplication  of  two  matrices, 
which  is  different  from  the  ordinary  matrix  multiplication.  The  product  C  of  two 
matrices  A  and  B  is  defined  as 

Cy  =  min*(au  +  by)  k  -  1,2 . N  (5  2) 


The  Cascade  Algorithm  is  essentially  two  successive  squarings  of  the  origi¬ 
nal  distance  matrix  associated  with  the  graph,  with  the  multiplication  of  two 
matrices  defined  as  above.  Two  things  have  to  be  kept  in  mind  when  carrying  out 
this  algorithm: 

(1)  Elements  in  the  matrix  must  be  updated  immediately  as  soon  as  they  are 
calculated. 

(2)  Elements  must  be  calculated  in  the  order  du . d lAr;  d2i . d^; . 

dj/x . dfw;  on  the  first  squaring  (forward  process)  and  in  the  order 

dNN . dNl\  . . . cf/v_i . dlN, . dn  on  the  second  squaring  (back¬ 

ward  process). 

The  resultant  matrix  S  is  the  shortest  distance  matrix  between  any  pair  of 
nodes  in  the  graph. 

In  many  applications,  the  shortest  path  itself  is  required,  i.e  the  inter¬ 
mediate  nodes  which  constituent  the  shortest  path  in  going  from  node  i  to 
node  j  have  to  be  known.  This  will  be  easily  available  if  we  make  suitable 
records  of  the  nodes  during  the  course  of  the  shortest  distance  calculation.  For 
the  purpose  of  keeping  track  of  the  shortest  path,  a  Touting  matrix  R  is  used. 
The  Tfj  element  of  R  contains  the  number  of  the  first  node  on  a  shortest  path 
from  node  i  to  node  j  .  The  elements  in  R  are  changed  as  the  algorithm 
proceeds.  Initially,  the  is  set  equal  to  j  for  all  i  and  j  .  During  the  shortest 
distance  calculation,  whenever  we  have  to  change  an  element  in  D,  the 

corresponding  element  has  to  be  updated  also,  i.e  whenever 

dq  «-  due  +  t4>  .  «-  . 

If  we  are  given  a  fully  connected  undirected  graph  of  N  nodes,  we  will  have 
an  N  by  N  full  distance  matrix.  If  we  call  the  addition  of  two  elements  in  D  and 
the  comparison  to  obtain  the  minimum  as  one  operation,  the  Cascade  algorithm 
requires  ZN9  such  operations.  Hence  it  is  a  polynomial  time  algorithm. 


The  following  example  will  illustrate  how  the  Cascade  algorithm  works.  Sup¬ 
pose  a  directed  graph  and  the  weights  of  the  edges  are  given  in  the  following 


figure  : 


The  original  distance  matrix  D  and  the  routing  matrix  R  are 


D  = 


0  3  4  »? 

oo  0  “  1  “ 

■»  »  0  10  2 

OP  00  00  0  00 


(00  4  00  00  0 


R  = 


1  2  3  4  5 
1  2  3  4  5 
1  2  3  4  5 
1  2  3  4  5 
1  2  3  4  5, 


The  matrix  D  and  the  matrix  R  after  one  step  of  the  Cascade  Algorithm  are 


D  = 


0  3  4  4  8 

OO  0  OO  1  OO 

OO  00  0  10  2 

OO  OO  OO  Q  Oo 


loo  4  00  00  0 


The  final  matrix  S  and  the  matrix  R  are 


R  = 


1  2  3  2  3 
1  2  3  4  5 
1  2  3  4  5 
1  2  3  4  5 
1  2  3  4  5 


S  = 


0  3  4  4  8 

■o  0  00  1  00 

*>6  0  7  2 

00  00  00  0  00 

00  4  00  5  0 


R  - 


1  2  3  2  3 
1  2  3  4  5 
1  5  3  5  5 
1  2  3  4  5 
1  2  3  2  5 


jL  ^ V  vVA.  .•’■•A. 


% 


It  can  easily  be  verified  for  the  above  example  that  the  entries  in  the  final 
distances  matrix  S  give  the  shortest  distance  between  a  pair  of  nodes  in  the 
directed  graph  and  the  routing  matrix  R  gives  the  shortest  path.  For  example, 
the  shortest  distance  from  node  3  to  node  4  is  7  and  the  shortest  path  is  from 
node  3  to  node  5  (  )  to  node  2  (  rM  )  to  node  4. 

5.3.  Modification  Of  The  Cascade  Algorithm  To  Find  Critical  Paths 

There  are  basically  two  changes  which  have  to  be  made  to  the  Cascade  algo¬ 
rithm  in  order  to  find  the  longest  distance  between  any  pair  of  nodes.  The  first 
modification  which  is  quite  obvious  is  in  calculating  the  square  of  the  distance 
matrix.  In  the  definition  of  the  product  C  of  two  matrices  A  and  6.  we  change 
the  minimum  to  maximum 

Cy  =  max*  (  a*  +  bkj  )  k  =  1,2 . N  (5.3) 

The  second  modification  to  the  algorithm  is  on  the  entries  of  the  distance 

matrix.  Instead  of  assigning  the  entry  dy  the  value  of  infinity  (<»)  if  there  is  no 

edge  from  node  <  to  node  jf ,  we  just  leave  that  entry  empty.  During  the  forward 

and  backward  processes,  we  will  not  consider  that  particular  entry.  The  update 

procedure  for  the  routing  matrix  stays  the  same. 

The  complexity  of  the  modified  algorithm  is  the  same  as  the  unmodified 
algorithm  for  a  fully  connected  undirected  graph.  However,  in  most  of  the  real 
applications,  the  graph  is  not  fully  connected  and  in  some  cases,  it  is  directed. 
This  implies  that  the  distance  matrix  D  is  very  sparse.  Hence  for  efficient  mani¬ 
pulation  of  matrix  D  in  the  Cascade  algorithm  and  for  efficient  storage  of  a 
graph  with  a  very  large  number  of  nodes,  a  special  data  structure  for  the  dis¬ 
tance  matrix  and  the  routing  matrix  is  necessary.  That  will  be  the  subject  of 
next  section. 
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5.3. 1.  Data  Structure  For  Matrices  D  and  R 


The  basic  operation  in  the  Cascade  algorithm  is  the  maximization  of  the 
sums  of  the  elements  in  a  certain  row  and  the  corresponding  elements  in  the 
column  in  D.  Hence  nonzero  entries  in  a  row  and  a  column  should  be  linked 
together.  A  sensible  choice  is  a  uni-directional  linked  list  across  the  rows  and 
down  the  columns.  With  this  simple  data  structure,  only  nontrivial  operations 
are  performed  and  the  storage  requirement  will  be  much  less  even  for  a  graph 
with  a  very  large  number  of  nodes.  A  similar  data  structure  can  be  used  for  the 
routing  matrix  R.  However  we  only  link  up  those  elements  whose  values  have 
been  updated.  The  other  elements  will  have  the  same  value  as  the  correspond¬ 
ing  column  index. 

We  use  the  previous  example  to  show  how  it  works: 

First  the  original  distance  matrix  D  and  the  routing  matrix  R  are 


0  3  4  7 

1  2  3  4  5 

0  1 

1  2  3  4  5 

0  10  2 

R  = 

1  2  3  4  5 

0 

1  2  3  4  5 

4  0 

1  2  3  4  5 

The  matrix  D  and  the  matrix  R  after  one  step  of  the  modified  Cascade  algorithm 
are 


0  11  4  14  7 

1  5  3  3  5 

0  1 

1  2  3  4  5 

D  = 

0  10  2 

R  = 

1  2  3  4  5 

0 

1  2  3  4  5 

4  0 

1  2  3  4  5 

The  final  matrix  S  and  the  matrix  R  are 

0  11  4  14  7 

1  5  3  3  5 

0  1 

1  2  3  4  5 

5  = 

8  0  10  2 

R  - 

1  5  3  4  5 

0 

1  2  3  4  5 

4  5  0 

1  2  3  2  5 

From  the  above  S  matrix,  the  length  of  the  critical  path  is  14,  and  it  goes 


from  node  1  to  node  3  (ru)  to  node  4. 

5.4.  Application  of  Modified  Cascade  Algorithm  to  LU  Task  Graph 

The  above  modified  Cascade  Algorithm  cannot  be  applied  directly  to  the  LU 
task  graph  to  find  the  critical  path.  The  reason  is  that  this  algorithm  finds  the 
critical  path  in  a  directed  graph  without  taking  into  account  the  weights  on  the 
nodes  along  the  critical  path.  In  other  words,  the  algorithm  chooses  the  longest 
path  in  a  graph  whose  nodes  have  zero  weights  (the  edges  are  still  weighted). 
Recall  that  the  nodes  in  the  LU  graph  have  weights  which  are  the  execution 
times.  Hence  some  modifications  to  the  distance  matrix  are  necessary  in  order 
to  use  this  Cascade  algorithm.  The  simple  idea  of  just  adding  the  execution 
times  of  the  starting  and  terminating  nodes  to  the  weight  of  the  edge  connecting 
them  will  not  work.  If  we  examine  the  algorithm  carefully,  that  modification  to 
the  distance  matrix  might  result  in  counting  the  execution  times  of  those  inter¬ 
mediate  nodes  lying  on  the  path  more  than  once.  So  that  leads  us  to  the  idea  of 
just  adding  the  execution  time  of  the  starting  node  to  the  weight  of  the  edge, 
and  then  perform  the  forward  and  backward  processes  of  the  Cascade  algorithm 
on  the  modified  distance  matrix  and  update  the  routing  matrix  as  usual.  The  dis¬ 
tance  matrix  obtained  will  give  the  longest  distance  between  any  pair  of  nodes 
without  taking  into  account  the  execution  times  of  the  terminating  nodes  of  the 
edges.  Hence  we  have  to  add  to  each  nonzero  entry  in  the  distance  matrix  the 
corresponding  execution  time  of  the  terminating  node.  Based  on  the  nonzero 
entries  in  the  final  distance  matrix,  we  can  find  out  the  critical  path  in  the 
graph. 

Perhaps  it  is  a  good  idea  to  use  our  previous  example  to  illustrate  the  above 
explanation.  Let  us  use  the  notation  t*  to  denote  the  execution  time  of  node  t 
and  suppose  we  have  the  following  execution  times  for  the  nodes  in  the  graph: 
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r,  =  2; 

Tg  =  1  ; 

rs  =  10 ; 
T*  =  1 : 
r0  =  3; 


The  original  distance  matrix  D  and  the  routing  matrix  R  are 


0  3  4  ? 

1  2  3  4  5 

0  1 

1  2  3  4  5 

0  10  2 

R  = 

1  2  3  4  5 

0 

1  2  3  4  5 

4  0 

1  2  3  4  5 

The  distance  matrix  D  after  the  addition  of  execution  times  of  the  starting  nodes 
to  the  nonzero  entries  and  the  routing  matrix  R  are 


0  5  8  9 

1  2  3  4  5 

0  2 

1  2  3  4  5 

0  20  12 

R  = 

1  2  3  4  5 

0 

1  2  3  4  5 

7  0 

1  2  3  4  5 

The  distance  matrix  D  after  the  forward  process  and  the  routing  matrix  R  are 


0  18  8  28  18 

1  5  3  3  3 

0  2 

1  2  3  4  5 

19  0  21  12 

R  = 

1  5  3  5  5 

0 

1  2  3  4  5 

7  9  0 

1  2  3  2  5 

The  distance  matrix  D  after  the  backward  process  and  the  routing  matrix  R  are 


0  25  8  27  IB 

1  3  3  3  3 

0  2 

1  2  3  4  5 

19  0  21  12 

R  - 

1  5  3  5  5 

0 

1  2  3  4  5 

7  9  0 

1  2  3  2  5 

The  final  distance  matrix  S  after  the  addition  of  execution  times  of  the  terminal 
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ing  nodes  to  the  nonzero  entries  and  the  routing  matrix  R  are 

1  3  3  3  3 

1  2  3  4  5 

R  -  1  5  3  5  5 

1  2  3  4  5 

1  2  3  2  5 

From  the  final  distance  matrix  S,  the  length  of  the  critical  path  is  28.  From 
the  routing  matrix  R,  the  critical  path  starts  at  node  1  to  node  3  (r14)  to  node 
5  to  node  2  (r^)  to  node  4  (rM). 

5.5.  Clustering  Technique 

Clustering  refers  to  the  grouping  of  nodes  together,  and  nodes  clustered 
together  will  be  assigned  to  the  same  processor  so  that  the  communication 
delay  among  the  nodes  is  eliminated.  "Which  nodes  should  be  clustered  together 
is  of  primary  importance.  Recall  that  the  main  objective  is  the  reduction  of  the 
completion  time  of  the  LU  task  graph.  Hence  the  effect  of  clustering  nodes 
together  should  accomplish  this  end.  As  pointed  out  in  the  earlier  chapters,  the 
length  of  the  critical  path  has  a  close  correlation  with  the  completion  time  of 
the  graph.  Actually  the  length  of  the  critical  path  without  adding  the  delay  along 
the  path  is  the  minimum  time  required  to  finish  that  task  graph  no  matter  how 
many  processors  are  available.  This  is  the  best  we  can  achieve.  In  the  worst  case 
assumption  where  each  edge  has  a  delay  associated  with  it,  the  sensible 
approach  is  the  shortening  of  the  critical  path  by  clustering  nodes  on  the  criti¬ 
cal  path. 

5.5.1.  A  Modified  Heuristic  Approach 

This  problem  of  reducing  the  communication  overhead  has  been  addressed 
by  Jen[28]  and  Efe[29]  .In  this  heuristic  technique,  a  few  modifications  are  made 
to  their  heuristics  for  arbitrary  task  graphs  in  order  to  obtain  a  more  optimistic 


approach.  The  definitions  pertaining  to  the  description  of  the  heuristic  are 
given  below  : 

Let  G  —  ( N .  E,  <)  be  the  original  directed  graph  with  N  nodes  and  a  set  of  edges 
E. 

(1)  (i.  j)  -  a  directed  edge  from  node  i  to  node  j. 

(2)  t(i)  =  execution  time  for  node  t. 

(3)  d(i,  j)  =  communication  delay  from  node  i  to  node  j. 

Let  Ne  be  a  set  of  nodes  clustered  into  a  big  node  C.  The  reduced  graph 
Gr  -  (NTi  Ert  <)  of  G  is  defined  as 


(1) 

Nr  =  N-Nc+C; 

(2) 

For  all  t  in  Nc ,  j  in  N  - 

■Ne  ; 

(c.j)  =  (*.»;(*. 

c)  =  (J .  i) ; 

(3) 

*(c)  =  £*(0  for  all  i 

in  C; 

(4) 

d(c,j)  =  Max  d(j,i) 

for  all  i  in  C  ; 

(5) 

c)  =  Max  d(j,  i) 

for  all  On  C  ; 

Consider  an  example.  Figure  5.1(a)  is  the  given  precedence  graph  with  six 
nodes.  Hie  communication  delay  between  a  pair  of  nodes  is  as  shown.  Suppose 
we  decide  to  cluster  nodes  2,  3  and  4  together  in  order  to  eliminate  the  delay 
between  these  nodes.  The  resultant  reduced  graph  is  shown  in  Figure  5.1(b). 

The  reduced  graph  has  four  nodes  and  the  clustered  node  C  consists  of 
nodes  2,  3  and  4  of  the  original  graph.  The  delays  da  and  **34  are  set  to  zero  and 
the  other  delays  are  shown  in  Figure  5.1(b).  The  execution  times  of  the  clustered 
node  C  is  equal  to  the  sum  of  the  execution  times  of  the  nodes  2,  3  and  4.  We 
assume  that  the  nodes  in  the  clustered  nodes  are  executed  sequentially  by  one 
processor.  However  a  further  assumption  in[2B]  is  that  data  needed  by  other 
nodes  are  not  transmitted  until  the  end  of  the  execution  of  all  the  nodes  in  that 
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cluster.  This  will  certainly  delay  the  completion  time  of  the  task  graph.  In  the 
previous  example,  nodes  2.  3  and  4  are  executed  before  the  transmission  of  the 
data  to  nodes  5  and  6.  In  our  heuristic  approach,  once  a  node  in  the  cluster  is 
executed  and  if  the  result  is  needed  by  some  other  nodes,  the  result  is  transmit¬ 
ted  immediately.  In  the  example,  results  from  nodes  2  and  3  will  be  sent  to  node 
5  without  waiting  for  node  4  to  finish  its  execution.  Hence  the  execution  of  node 
5  will  not  be  delayed. 

Another  feature  of  the  heuristic  is  the  load-balancing  constraint.  The  load¬ 
balancing  constraint  limits  the  maximum  number  of  nodes  a  cluster  can  have. 
This  reduces  the  possibility  of  overloading  the  processors  when  scheduling  is 
performed  on  the  reduced  task  graph.  The  reason  to  implement  the  load¬ 
balancing  constraint  is  that  it  is  not  uncommon  to  find  that  a  large  number  of 
nodes  are  clustered  together  in  some  clusters  while  there  may  be  relatively 
fewer  number  of  nodes  in  some  other  clusters.  This  introduces  the  effect  of 
uneven  loading  on  the  processors.  In  other  words,  some  processors  may  be 
assigned  to  execute  a  large  number  of  nodes  compared  to  other  processors.  This 
results  in  the  effect  of  unbalanced  allocation  of  resources  and  it  might  also  pro¬ 
duce  an  undesirable  schedule  when  a  scheduling  algorithm  is  applied  to  the 
reduced  graph.  The  load-balancing  constraint  has  the  disadvantage  of  not  pro¬ 
ducing  the  shortest  critical  path  in  the  heuristic  sense.  The  reason  is  that  nodes 
that  must  be  clustered  together  in  order  to  give  a  shorter  critical  path  are 
forced  to  break  apart  to  form  more  clusters  because  of  the  constraint  on  the 
maximum  number  of  nodes  a  cluster  can  have.  Hence  the  communication  delay 
will  remain  between  these  clusters  thus  lengthening  the  critical  path. 

Simulation  programs  were  written  and  run  on  several  examples  and  the 
results  are  summarized  In  the  next  few  sections.  A  structured  and  high  level 
description  of  the  simulation  programs  is  given  in  the  following  section. 
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5.6.  High  Level  Description  of  the  Heuristic  Clustering  Technique 

After  describing  the  finding  of  the  critical  path  and  the  clustering  tech¬ 
nique,  we  are  ready  to  give  a  more  detail  description  of  the  heuristic  global  clus¬ 
tering  algorithm.  Let  us  adopt  the  following  notations: 

G  :  precedence  graph 
L  :  critical  path  of  G 
G' :  new  precedence  graph 
L  ’ :  critical  path  of  G' 

Nc(i)  :  number  of  nodes  in  cluster  i 

Ma^jid  :  Maximum  number  of  nodes  a  cluster  can  have 

ALGORITHM 
Initialization : 

Given  an  initial  precedence  graph.  G.  Specify  Maxjid  and  initialize  the 
length  of  £,'  =  <*>. 
for(  ;  :  )  /*  forever  loop  •/ 

(1)  Find  the  critical  path  L  in  G. 

If  (  the  length  of  V  <  length  of  L)  exit 
Else  continue 

(2)  Unmark  all  the  edges  on  L. 

(3)  Find  node  t  and  node  j  on  L  such  that  edge(i,  j)  is  unmarked. 

If  (  no  such  edge  exits  )  exit 

Else  continue 

If  (  neither  node  <  nor  node  j  is  in  a  cluster  ) 

create  a  new  cluster  containing  these  two  nodes.  Set  the  delay 
d(i,  j)  to  zero. 
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Else  if(  either  node  i  or  node  j  is  in  an  existing  cluster  k  ), 
Nc(k)  =  Nc(k)+  1: 

{  If  (  Nc(k)  >  Maxjid  ), 

mark  edge(i,  j).  go  to  (3). 

Else 

group  the  node  (i  or  j )  not  in  the  cluster  into  the  cluster  k 
and  set  all  the  delays  between  that  node  and  the  nodes  in 
cluster  k  to  zero.  i.e. 

d(i,  c)  =  0;  for  all  c  in  cluster  k  if  node  j  is  in  cluster  k . 

d(J,  c)  =  0;  for  all  c  in  cluster  k  if  node  j  is  in  cluster  A:, 

j  /•  end  of  Else  if  •/ 

Else  if  (  node  i  is  in  cluster  p  and  node  j  is  in  cluster  q ) 

(  if  (  Nc(p)  +  Nc(q)  >  Ma*_pd  ) 
mark  edge(i,  j).  go  to  (3) 

else 

group  all  the  nodes  in  clusters  p  and  q  together  and  set  all 
the  delays  between  the  nodes  in  clusters  p  and  q  to  zero.  i.e. 
d(i.  c)  =  0;  for  all  c  in  cluster  q 

d(j.  d)  =  0;  for  all  d  in  cluster  p 

j  /•  end  of  Else  if  •/ 

(4)  Call  the  reduced  graph  (  graph  after  clustering  )  G\  and  find  its 
critical  path  L\  Set  G  -  G\  go  to  (1). 

{  /*  end  of  forever  loop  •/ 

5.8. 1.  Data  Structure  for  Managing  the  Clusters 

Note  that  in  this  heuristic  algorithm,  keeping  the  records  of  which  nodes 
are  in  which  clusters  is  very  important.  This  determines  which  delay  on  an  edge 
will  be  set  to  zero.  In  this  section,  a  data  structure  for  manipulating  these 


clusters  will  be  described.  This  data  structure  will  have  the  flexibility  of  adding 
new  clusters  as  they  are  formed,  keeping  track  of  what  nodes  belong  to  which 
cluster,  inserting  new  nodes  to  the  existing  clusters  and  grouping  clusters 
together.  A  uni-directional  linked  list  is  used  for  the  above  purpose.  The  struc¬ 
ture  has  three  fields  defined  as 
struct  node_pluster 
(  int  node_po  ; 
int  cluster ; 

struct  node_pluster  *pt : 

The  first  field  (  int  nodejio  )  identifies  the  node  number  of  the  node.  The 
second  field  (  int  cluster  )  tells  which  cluster  that  particular  node  belongs  to. 
The  third  field  (  struct  node_pluster  *pt  )  is  a  pointer  pointing  to  the  next  struc¬ 
ture  of  the  same  kind.  An  example  is  shown  in  Figure  5.2.  Nodes  3  and  10  are  in 
cluster  2  and  nodes  12  and  14  are  in  cluster  3.  Suppose  in  the  clustering  pro¬ 
cess.  we  find  that  node  3  and  node  7  should  be  clustered  together.  By  transvers- 
ing  through  the  existing  linked  list,  we  discover  that  node  3  is  in  cluster  2.  Hence 
we  add  the  new  node  7  to  the  linked  list  as  shown  in  Figure  5.2(b).  In  this  exam¬ 
ple.  cluster  2  contains  nodes  3.  7  and  10.  A  new  cluster  which  is  not  in  the  exist¬ 
ing  linked  list  can  be  inserted  similarly.  Note  that  nodes  are  inserted  in  such  a 
way  that  they  are  arranged  in  ascending  order  according  to  their  node  numbers 
for  easy  searching.  Once  a  pair  of  nodes  for  clustering  is  identified,  the  delays 
on  those  edges  that  need  to  be  eliminated  are  found  according  to  step  (3)  in  the 
algorithm  described  in  the  previous  section.  We  can  go  directly  to  the  distance 
matrix  associated  with  the  graph  and  set  the  values  of  the  appropriate  entries  to 
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5.7.  Complexity  of  the  Heuristic  Clustering  Algorithm 

The  dominant  cost  in  the  heuristic  clustering  algorithm  is  the  finding  of  the 
critical  path  in  the  directed  graph.  As  mentioned  in  section  5.2.2,  the  Cascade 
Algorithm  requires  2 N9  operations  for  a  N  by  N  full  distance  matrix  ( 
corresponding  to  a  fully  connected  directed  directed  graph  with  N  nodes  ) 
where  one  operation  is  defined  as  the  addition  of  two  elements  in  the  distance 
matrix  and  the  comparison  to  obtain  the  maximum.  The  other  steps  like  access¬ 
ing  the  data  structures  of  the  clusters,  the  distance  matrix  and  the  routing 
matrix  take  substantially  less  time  compared  to  the  Cascade  Algorithm.  For 
example  finding  whether  a  node  is  in  one  of  the  clusters  requires  at  most  N 
comparisons  when  transversing  the  linked  list  of  the  clusters.  Since  we  have  to 
repeat  the  finding  of  the  critical  path  and  accessing  the  necessary  data  struc¬ 
tures  iteratively  until  the  shortest  critical  path  is  attained,  the  worst  case  takes 
Nx2N*  operations  by  going  through  all  the  nodes  in  the  graph.  Hence  the  com¬ 
plexity  of  the  heuristic  is  at  most  0(N4). 

However  in  most  of  the  applications,  the  directed  graph  is  not  fully  con¬ 
nected.  This  is  the  case  for  the  LU  task  graph,  since  in  the  Doolittle  Algorithm 
there  are  no  loops  in  the  directed  graph.  Furthermore,  the  two  operations, 
update  and  divide,  at  most  have  to  receive  three  values  in  order  to  carry  out  the 
calculation.  This  implies  that  the  in-degree  (  number  of  edges  going  into  a  node  ) 
of  a  node  (  an  operation  )  is  at  most  three.  Hence  most  of  the  distance  matrices 
associated  with  a  LU  task  graph  are  extremely  sparse.  With  the  linked  list  data 
•tructure  for  the  distance  matrix,  the  complexity  of  the  heuristic  clustering 
algorithm  is  expected  to  be  much  less  than  0(N*)  where  N  is  the  number  of 
nodes  in  the  LU  task  graph. 


i 


m  4 

* 


-i 


M 

j 

••j 


a 


» 


130 


ft 


5.8.  Simulation  Results 

In  this  section,  several  directed  graphs  with  different  number  of  nodes  are 
tested  on  this  heuristic  clustering  algorithm.  One  graph  is  generated  from  the 
circuit  matrix  of  a  bench-mark  circuit  Ln[l],  The  others  are  generated  ran¬ 
domly.  These  examples  are  run  on  a  VAX  11/780  computer.  Three  different  simu¬ 
lation  results  were  obtained.  The  most  important  one  shows  Ihe  effectiveness  of 
reducing  the  length  of  the  critical  path  of  a  given  graph  as  a  function  of  the  con¬ 
straint  on  the  maximum  number  of  nodes  allowed  in  one  cluster.  The  second 
simulation  results  shows  the  reduction  of  the  number  of  nodes  of  the  reduced 
graph  as  a  function  of  the  constraint  on  the  maximum  number  of  nodes  in  a 
cluster.  The  last  simulation  results  compare  the  performance  of  this  heuristic 
clustering  technique  with  those  local  heuristic  algorithms  described  in  chapter 
four.  The  performance  is  measured  based  on  the  speedup  ratio  obtained  on 
these  examples  under  the  assumption  that  enough  processors  are  available  to 
achieve  the  maximum  speedup  ratio.  This  speedup  performance  is  plotted  as  a 
function  of  delay  for  all  the  heuristic  algorithms.  For  the  previous  two  simula¬ 
tions.  delay  is  a  parameter  in  the  curves.  Each  of  the  above  three  simulations  is 
described  in  more  detail  in  the  following  sub-sections. 
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5.8.1.  Reduction  of  the  Length  of  the  Critical  Path 

The  performance  is  measured  as  the  percentage  of  reduction  on  the  origi¬ 
nal  length  of  the  critical  path  of  a  given  directed  graph.  The  percentage  of 
reduction  is  defined  as 

%  of  reduction  =  (g™P* )  -  dust  Jen  ngh^100 

origjen  ( graph ) 

where  arrigjen (graph)  =  Length  of  the  original  graph,  clust Jen  (graph.)  - 
Length  of  the  final  reduced  graph.  Two  sets  of  results  are  generated  with  delay 
as  a  parameter.  The  first  set  of  results  shows  the  situation  where  data  needed  by 
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other  processors  is  transmitted  once  it  is  generated  in  a  clustered  node.  The 
results  are  shown  from  Figure  5.3(a)  to  Figure  5.3(c).  The  second  set  of  results 
assumes  the  case  where  all  the  nodes  in  a  cluster  are  executed  before  transmit¬ 
ting  the  data  needed  by  other  processors.  It  is  expected  that  the  first  case  gives 
a  better  performance  than  the  second  case  as  we  have  already  addressed  the 
above  problem  in  the  section  describing  the  heuristic  algorithm.  The  second  set 
is  shown  from  Figure  5.4(a)  to  Figure  5.4(c).  These  show  that  as  the  number  of 
nodes  per  cluster  increases,  the  percentage  of  reduction  increases  until  a  cer¬ 
tain  point  is  reached  where  further  clustering  is  not  necessary.  In  all  cases,  as 
seen  from  the  simulation  results,  the  clustering  technique  becomes  more 
effective  as  the  delay  in  the  graph  increases. 

S.B.2.  Reduction  on  the  Number  of  Nodes 

When  the  clustering  technique  is  applied  to  a  graph,  nodes  are  grouped  into 
clusters.  By  the  definition  of  the  reduced  graph,  nodes  in  a  cluster  are  executed 
in  sequence  and  the  cluster  is  considered  as  one  node  during  the  next  iteration 
of  the  algorithm.  Hence  the  number  of  nodes  in  the  reduced  graph  will  have 
fewer  nodes  than  the  original  graph.  This  will  have  a  desirable  effect  on  the  com¬ 
putational  time  spent  in  the  assignment  of  the  nodes  to  processors.  Recall  that 
most  of  the  feasible  scheduling  algorithms  have  polynomial  running  time  in  N 
where  N  is  the  number  of  nodes  to  be  scheduled  in  a  directed  graph.  Hence 
clustering  nodes  together  will  reduce  the  time  in  scheduling  the  nodes  in  the 
resultant  reduced  graph. 

The  performance  of  the  heuristic  in  this  sense  is  measured  as  the  percen¬ 
tage  reduction  of  the  number  of  nodes.  It  is  defined  as 

7.  of  reduction  = 

where  origjnum  (graph.)  =  number  of  nodes  in  the  original  graph. 
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clust_num (graph)  =  number  of  nodes  in  the  reduced  graph.  The  results  are 
shown  from  Figure  5.5(a)  to  Figure  5.5(c)  where  the  two  curves  correspond  to 
two  cases  mentioned  in  the  subsection  5.4.1.  The  curve  B  corresponds  to  the 
optimistic  case  where  data  is  transmitted  immediately  once  the  node  is  exe¬ 
cuted  in  the  cluster.  The  curve  A  corresponds  to  the  pessimistic  case  where  data 
is  transmitted  only  after  all  the  nodes  in  the  cluster  are  executed.  The  same 
conclusion  is  observed  as  in  the  reduction  of  the  length  of  the  critical  path.  The 
percentage  reduction  increases  as  the  constraint  on  the  number  of  nodes  in  the 
cluster  increases  until  it  reaches  a  point  where  further  clustering  will  not 
reduce  the  number  of  nodes  in  the  graph. 

The  simulation  results  discussed  in  the  previous  two  subsections  reflect  the 
performance  of  the  global  heuristic  clustering  technique  measured  on  the 
reductions  on  the  length  of  the  critical  path  and  the  number  of  nodes  on  the 
directed  graph.  In  the  following  subsection,  this  heuristic  will  be  compared  with 
the  local  heuristics  already  described  in  chapter  four  on  how  much  speedup  can 
be  achieved  when  they  are  applied  to  schedule  nodes  on  processors. 

6.8.3.  Comparison  of  Global  and  Local  Heuristic  Techniques 

The  global  heuristic  clustering  technique  presented  in  this  chapter,  in  a 
certain  sense,  provides  an  alternative  way  of  scheduling  nodes  to  the  processors 
in  the  presence  of  communication  delay.  The  comparison  of  the  global  and  local 
heuristic  techniques  on  the  performance  is  measured  as  the  speedup  ratio 
achieved  under  the  assumption  that  enough  processors  are  available  to  obtain 
the  maximum  speedup.  Under  this  assumption,  the  completion  time  of  the 
reduced  graph  is  the  length  of  the  critical  path  on  that  graph.  We  also  make  the 
comparison  under  the  optimistic  global  clustering  case  in  which  the  result 
needed  by  other  processors  is  transmitted  immediately  once  it  is  computed  in 
the  cluster. 
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The  speedup  ratio  is  defined  as 


eeduv  ratio  ~  CorrvP^e^on  ^me  using  single  processor 
*  v  completion  time  using  m  processors 

where  m  is  the  number  of  processors  needed  to  achieve  the  maximum  speedup. 
The  same  four  example  graphs  are  used  for  the  purpose  of  comparison  of  the 
performance  as  a  function  of  communication  delay.  The  results  are  shown  from 
Figure  5.6(a)  to  Figure  5.6(d).  They  show  that  the  global  clustering  technique 
attains  better  speedup  ratio  than  the  local  heuristics,  except  in  the  example 
graph  with  18  nodes,  where  it  achieves  the  same  speedup  as  the  heuristic  D.  E 
and  F  (  see  Figure  5.6(b)  ).  In  general,  the  global  technique  gives  in  the  range  of 
50%  to  100%  better  speedup  performance  than  the  local  heuristics. 


5.0.  Conclusion 

This  chapter  presents  the  basic  techniques  used  in  the  global  clustering 
algorithm.  It  performs  node  clustering  on  the  critical  path  of  a  directed  graph  to 
eliminate  the  communication  delay.  The  iterated  process  of  clustering  is  done 
on  the  whole  graph  as  compared  to  the  technique  of  minimizing  the  completion 
time  at  each  time  step  described  in  the  last  chapter.  Four  example  graphs  are 
used  to  test  out  its  performance  measured  on  the  reduction  on  the  length  of  the 
critical  path,  the  reduction  on  the  number  of  nodes  and  the  speedup  ratio 
achieved  compared  to  the  local  heuristic  techniques.  The  simulation  results 
show  that  the  global  clustering  technique  performs  better  than  the  four  local 
heuristics. 

A  final  remark  is  that  the  execution  times  of  the  nodes  in  the  reduced 
graph  are  different  after  the  clustering  technique  is  applied.  However,  the 
heuristic  scheduling  techniques  described  in  chapter  four  still  can  be  employed 
to  schedule  the  nodes  of  the  reduced  graph  to  the  processors.  Some 
modifications  are  necessary  to  the  computer  programs  written  for  the  local 
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heuristics  to  take  into  account  the  different  execution  times  of  the  nodes  when 
the  elapsed  times  of  the  processors  are  calculated.  A  major  addition  is  needed 
to  the  programs  of  the  clustering  technique  so  that  the  reduced  graph  is  in  the 
correct  input  format  to  the  local  heuristics  described  in  the  last  chapter. 


An  Example  Of  Clustering  Nodes  In  A  Directed  Graph,  (a)  The 
Original  Graph  (b)  The  Reduced  Graph  After  Nodes  2,  3  and  4 
Are  Clustered. 
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Data  Structure  For  Clusters,  (a)  Data  Structure  For  Clusters  2 
And  3  (b)  Data  Structure  After  Node  7  Is  Clustered  Into  Cluster 
2. 
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Figure  5.3(a) 
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Reduction  On  The  Length  Of  The  Critical  Path  With  Different 
Delays.  Data  Is  Transmitted  After  The  Execution  Of  Each  Node 
In  A  Cluster.  Example  Graph  With  18  Nodes 
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Figure  5.3(b) 


Reduction  On  The  Length  Of  The  Critical  Path  With  Different 
Delays.  Data  Is  Transmitted  After  The  Execution  Of  Each  Node 
In  A  Cluster.  Example  Graph  With  32  Nodes 


Maximum  Number  of  Nodes  Per  Cluster 


€> 


Rgure  5.3(c)  Reduction  On  The  Length  Of  The  Critical  Path  With  Different 
Delays.  Data  Is  Transmitted  After  The  Execution  Of  Each  Node 
In  A  Cluster.  Example  Graph  With  58  Nodes 


Figure  5.4(c)  Reduction  On  The  Length  Of  The  Critical  Path  With  Different 
Delays.  Data  Is  Transmitted  After  The  Execution  Of  ALL  The 
Nodes  In  A  Cluster.  Example  Craph  With  58  Nodes 
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Figure  5.5(a)  Reduction  On  The  Number  Of  Nodes.  Curve  A  Corresponds  To 
The  Pessimistic  Case.  Curve  B  Corresponds  To  The  Optimistic 
Case.  Example  Craph  With  18  Nodes 


Figure  5.5(b)  Reduction  On  The  Number  Of  Nodes.  Curve  A  Corresponds  To 
The  Pessimistic  Case.  Curve  B  Corresponds  To  The  Optimistic 
Case.  Example  Graph  With  32  Nodes 


ure  5.6(d)  Comparison  Of  Globa]  And  Local  Heuristic  Techniques.  Exampl 
Graph  With  56  Nodes  . 


CHAPTER  6 


tea 


[.v.-l 

l'XA'  l.' 


CONCLUSION 

0.1.  Overall  Review  of  the  Prdblem 

In  this  dissertation  the  problem  of  decomposing  a  nonsingular,  unstruc¬ 
tured  sparse  mdtruointo  a  product  of  a  lower  triangular  matrix  and  an  upper  tri¬ 
angular  matrix  as  presented. '  The  algorithm  employed  is  the  Doolittle  method. 
The  factorization  is  represented  by  a  sequence  of  basic  operations  in  the  Doolit¬ 
tle  algorithm.  These  basic  operations  are  the  divided  operation: 
and  the  update  operation  :  Ofj  =  ay  —  •  a*j.  Because  of  the  order  in  which 

these  operations  mast  be  executed,  the  procedure  for  the  LU  factorization  can 
be  modeled  by  ;a  directed  graph.  Each  node  in  the  graph  represents  either  a 
divide  or  an  update  operation.  Using  these  precedence  constraints  for  the 
operations,  inherent  parallelism  can  be  detected.  Hu’s  level  scheduling  algo¬ 
rithm  is  used  to  obtain  a  deterministic  schedule  in  which  nodes  are  assigned  to 
different  processors!  for  concurrent  execution.  The  speedup  performance,  which 
is  defined  as  the  ratio  of  the  completion  time  using  one  processor  to  the  comple¬ 
tion  time  using  more  than  one;  processor,  is  very  satisfactory. 

However  when  there  is  a  communication  delay  in  transmitting  from  the 
sending  processor  to  the  destination  processor,  results  have  shown  that  the 
speedup  performance  based  Dn  the  schedules  obtained  from  scheduling  algo¬ 
rithms  without  communication  delay  consideration  degrades  substantially. 
Hence  other  scheduling  techniques  with  communication  delay  consideration 
should  be  developed.  These  scheduling  problems  are  inherently  intractable  and 
heuristic  scheduling;  algorithms  provide  a  plausible  approach. 

In  these  heuristic  techniques,  combinatorial  optimization  algorithms  such 
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as  min_rnax  matdhing,  weighted  matching  and  heuristic  clustering  techniques 
are  employed.  Immost  of  the  cases,  these  heuristic  scheduling  techniques  do 
produce  schedules  which  have  shorter  completion  time  than  the  schedules 
obtained  from  Hu's  level  scheduling  algorithm. 

6.2.  Significance  of  this  Research 

The  mapping  © f  an  algorithm  to  a  directed  graph  model  representation  can 
be  applied  to:many  algorithms  in  other  areas  such  as  digital  signal  processing. 
The  parallelism  in  the  graph  can  be  detected.  The  nodes  in  the  graph  do  not 
necessarily  represent  one  arithmetic  expression,  they  can  be  segments  of  code 
or  modules  such  as  do  loops  in  a  computer  program.  In  a  realistic  distributed 
processing  environment,  the  interprocessor  communication  overhead  is  una¬ 
voidable.  In  mnnyof  the  application  algorithms,  the  graph  model  representation 
or  the  data  dependency  graph  does  not  have  special  structure.  In  order  to  fully 
exploit  parallel  computing,  scheduling  methods  should  be  constructed  to  minim¬ 
ize  the  effects  of ! the  communication  overhead.  This  dissertation  attempts  to 
develop  these  itechniques  heuristic  ally.  Satisfactory  results  based  on  the  simula¬ 
tion  are  obtained. 

6.3.  Fixture  Related  Research 

In  this  section,  some  of  the  possible  future  related  research  areas  are  dis¬ 
cussed.  Here  the  execution  time  for  the  nodes  in  the  task  graph  have  been 
assumed  to  the  the  same.  However,  the  heuristic  techniques  described  in 
chapters  four  and  five  can  be  extended  to  the  case  of  unequal  node  execution 
times.  In  more  practical  situations,  the  nodes  or  in  general  the  modules  do  not 
have  equal  processing  times.  Since  the  heuristic  techniques  developed  also  con¬ 
sider  the  node  execution  time  when  assigning  node  to  processors,  it  is  expected 
that  promising  results  will  be  obtained. 


When  the  interprocessor  communication  is  ignored,  bounds  on  the  max¬ 
imum  and  minimum  number  of  processors  required  to  finish  the  task  graph  in 
the  shortest  time  can  be  obtained  quite  easily  given  the  structure  of  the  task 
graph.  It  is  observed  in  the  simulation  results  presented  in  chapter  four  that  in 
the  case  where  communication  delay  is  taken  into  account,  using  more  proces¬ 
sors  does  not  necessarily  improve  the  speedup  performance.  In  fact  the 
increase  in  the  number  of  processors  reduces  the  speedup  ratio  compared  to 
using  small  number  of  processors.  This  is  perhaps  due  to  the  fact  that  large 
number  of  processors  will  increase  the  possibility  of  assigning  nodes  and  their 
predecessors  to  different  processors.  Because  of  the  delay  between  the  com¬ 
municating  processors,  it  will  increase  the  completion  time  of  the  task  graph.  In 
this  case,  the  bounds  are  necessary  to  obtain  the  appropriate  number  of  proces¬ 
sors  in  order  to  achieve  the  maximum  possible  speedup. 

In  the  interconnecting  topology  of  the  processors,  a  fully  connected  switch 
is  assumed.  Hence  any  processor  can  communicate  with  all  other  processors. 
Given  a  task  graph,  a  fixed  communication  pattern  can  be  formed.  This  means 
that  some  message  traffic  statistics  between  a  pair  of  processors  is  observed. 
Based  on  this  traffic  pattern,  a  topology  of  interconnection  of  processors  can  be 
designed  to  achieve  an  even  more  efficient  execution  of  an  algorithm.  This  is 
particularly  useful  for  algorithms  which  have  a  fixed  communication  pattern 
between  processors.  With  a  topology  tailored  to  an  algorithm,  optimization  in  the 
area  of  efficient  message  routing  can  also  enhance  the  execution  speed  of  the 
algorithm.  With  the  knowledge  of  the  traffic  pattern,  intelligent  routing  schemes 
should  be  employed  for  efficient  exchange  of  data  enabling  the  processors  to 
acquire  the  necessary  data  faster.  This  will  increase  the  speedup  of  the  algo¬ 
rithm. 

All  these  thoughts  aim  at  the  goal  of  improving  the  overall  efficiency  of  dis  - 
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ABSTRACT 

This  dissertation  reports  on  a  study  of  algorithms  for  the  realization  of  digi¬ 
tal  infinite  impulse  response  (HR)  filters  using  multiple  programmable  elements 
(PE* 5).  The  motivation  is  to  increase  the  sampling  rate  which  can  be  achieved 
with  a  VLSI  implementation  of  a  digital  filter.  This  is  best  achieved  utilizing 
parallelism  since  VLSI  will  provide  dramatic  increases  in  hardware  complexity 
with  much  more  modest  increases  in  speed. 

The  algorithms  presented  can  be  classified  into  two  categories,  single-input 
single-output  (SISO)  and  multi-input  multi-output  (1£IM0)  filters.  Included  in  the 
SISO  category  are  the  well  known  cascade  and  parallel  forms,  systolic  arrays, 
and  Barnwell's  algorithm.  The  klMO  filter  is  also  called  a  block  filter  since  the 
input  samples  are  divided  into  blocks  and  processed  by  vector  and  matrix 
operations. 

The  theory  as  confirmed  by  simulation  results  indicate  that  block  filters  can 
achieve  a  much  higher  sampling  rate  than  SISO  filters,  at  the  expense  of  a  larger 
amount  of  hardware.  In  fact,  it  is  shown  that  the  sampling  rate  achieved  can 
become  arbitrarily  large  as  hardware  is  added,  so  that  the  die  area  and 


computational  resources  on  a  chip  become  the  sole  limitation  on  sampling  rate, 
not  the  speed  of  the  hardware.  Block  filters  achieve  their  Increase  in  speed  by 
the  addition  of  PE's,  as  well  as  by  increases  in  the  speed  of  the  PE's,  and  are 
therefore  well  suited  to  the  VLSI  technology. 

It  is  further  shown  that  a  two  dimensional  systolic  array  of  PE's  can  realize 
the  block  state  structure  for  an  IIR  digital  filter,  making  it  possible  to  achieve 
the  aforementioned  advantages  with  only  local  interconnections  of  PE's.  This 
meets  another  important  constraint  of  VLSI,  the  minimization  of  expensive  glo¬ 
bal  communications. 

IIR  filters  are  known  to  require  lower  computational  rate  as  compered  to  a 
finite  impulse  response  (FIR)  filters  with  approximately  the  same  transfer  func¬ 
tion.  However,  block  state  HR  filters  lose  their  superiority  over  FIR  filters  at 
extremely  high  sampling  rates.  An  example  shows  that  when  the  block  size 
exceeds  56  for  an  elliptic  filter,  the  FIR  filter  is  more  advantageous  in  realizing  a 
similar  response. 

Block  filters  are  also  shown  to  have  excellent  properties  as  far  as  roundoff 
noise  is  concerned.  Although  the  average  computation  rate  is  increased,  the 
average  output  roundoff  noise  decreases  when  a  single  rounding  is  performed  at 
each  internal  summing  node. 

Finally,  the  various  filter  structures  are  compared  with  respect  to  their 
roundoff  noise  susceptibility,  susceptibility  to  delay  in  the  PE  interconnection 
paths,  complexity  of  PE  interconnection,  etc. 
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CHAPTER  1 


INTRODUCTION 


1.1.  Impact  of  VLSI  on  Signal  PwesMing 

As  the  current  semiconductor  technology  advances,  more  and  more  transis¬ 
tors  can  be  put  into  a  single  silicon  wafer.  The  trend  of  this  technology  develop¬ 
ment  is  moving  from  Large  Scale  Integration  (LSI)  toward  Very  Large  Scale 
Integration  (VLSI).  This  technology  improvement  will  have  a  great  impact  on 
many  fields,  such  as  circuit  simulation,  computer  architecture  and  signal  pro¬ 
cessing  etc..  Since  the  computational  power  of  a  chip  is  directly  proportional  to 
the  overall  number  of  transistors  on  a  chip,  the  VLSI  circuit  can  provide  a  much 
more  powerful  computational  capability.  This  higher  computational  ability  will 
have  a  great  impact  on  the  signal  processing  systems  since  it  would  allow  us  to 
implement  much  more  sophisticated  algorithms  or  to  process  signals  at  a  much 
higher  sampling  rate.  However,  this  high  speed  signal  processing  cannot  be 
obtained  by  implementing  existing  algorithms  directly  in  VLSI,  since  the  speed 
of  VLSI  will  not  increase  as  rapidly  as  the  complexity.  Hence,  the  first  implica¬ 
tion  of  VLSI  is  that  in  order  to  effectively  utilize  the  high  density  chips,  parallel 
algorithms  should  be  employed. 

VLSI  is  achieved  by  scaling  down  the  size  of  the  devices,  with  an  attendant 
reduction  in  power  supply  voltage  to  maintain  reliability.  This  lower  supply  vol¬ 
tage  and  smaller  devices  severly  restricts  the  capability  to  send  data  over  long 
wires.  Hence,  although  computation  and  control  functions  are  relatively  plenti¬ 
ful  in  VLSI,  communications  is  expensive.  Communications  here  means  data 
transmission  between  different  parts  of  the  same  chip  as  well  as  between  chips. 
Together  with  the  parallel  processing  requirement,  this  implies  that  the  most 


desirable  algorithms  for  achieving  maximum  performance  in  a  VLSI  system  are 
those  which  achieve  increased  parallelism  with  a  minimum  communication 
among  the  parts  of  the  algorithm.  This  is  why  the  structures  like  systolic  arrays 
are  attractive  in  VLSI  applications. 

Since  digital  circuitry  suffers  less  from  the  scaling  than  analog  circuitry 
and  digital  processing  of  signals  can  usually  realize  more  sophisticated  algo¬ 
rithms  than  analog  processing,  the  trend  of  signal  processing  systems  is  moving 
toward  digital  processing.  This  dissertation  deals  with  the  new  algorithms  in 
digital  signal  processing  systems  specifically  for  the  implementation  of  digital 
DR  filters. 

With  conventional  digital  signal  processing  algorithms,  the  throughput  rate 
depends  heavily  on  the  chip  speed,  since  the  input  signal  is  usually  processed  in 
serial.  In  order  to  achieve  a  higher  throughput  rate,  some  modification  of  the 
existing  algorithms  has  to  be  made  to  exploit  this  high  chip  density.  The  above 
characterization  of  VLSI  suggests  that  structures  like  parallel  or  pipelining 
would  be  very  efficient  in  processing  high  speed  signals  using  relatively  low 
speed  hardware. 

The  Fast  Fourier  Transform  (FFT)  is  a  good  example  to  show  the  importance 
in  modifying  the  current  algorithms  and  the  tradeoff  between  the  overall  speed 
and  the  hardware  requirement.  The  FFT  has  been  considered  as  a  very  efficient 
algorithm  to  compute  the  Discrete  Fourier  Transform  (DFT)  because  it  requires 
much  less  computation.  However,  the  complex  data  flow  pattern  in  the  FFT[1] 
prohibits  us  from  utilizing  the  parallel  processing  technique,  since  the  mutually 
dependent  data  manipulations  make  it  difficult  for  a  VLSI  design.  Although  FFT 
has  a  pipelining  structure  in  nature,  the  butterfly  connection  among  cascading 
stages  requires  a  lot  of  wires  for  interconnection.  A  better  algorithm  to  maxim¬ 
ize  the  throughput  would  be  to  compute  the  DFT  directly  as  a  matrix-vector  mul- 


o 


3 


tiplication.  Many  structures,  which  use  extensive  pipelining  and  requires  simple 
interconnections  among  processing  elements,  are  known  to  be  very  efficient  in 
performing  this  multiplication. 

The  DFT  example  suggests  that  we  reconsider  the  existing  fast  algorithms  in 
digital  signal  processing  systems.  It  also  Implies  that  in  order  to  increase  the 
throughput  rate,  we  should  try  to  find  algorithms  which  effectively  exploit  paral¬ 
lelism  rather  than  necessarily  reduce  the  number  of  multiplications.  Although 
most  existing  signal  processing  algorithms  are  in  a  serial  form,  the  potential  of 
parallelism  makes  high  speed  algorithms  possible  in  VLSI  circuits.  What  has  to 
be  done  is  to  find  algorithms  which  can  maximize  the  parallelism. 

1.2.  Simulation  Tools 


Simulation  programs  are  necessary  to  examine  the  performance  of  the 
parallel  algorithms  to  be  implemented.  The  simulation  is  important  because  it 
allows  us  to  avoid  building  the  hardware  and  because  simulation  can  minimize 
the  hardware  design  time.  In  this  dissertation,  the  performance  as  measured  by 
the  speedup  compared  to  the  uniprocessor  case.  The  data  transmission  require¬ 
ments  will  be  determined  by  simulation.  To  satisfy  this  demand,  two  simulation 
programs,  both  written  in  ’C\  have  been  developed  and  are  running  on  a  VAX- 
11/780  machine.  SIMON  (  Simulator  of  Multiprocessor  Networks[2]  ).  is  a 
discrete-time  and  event  driven  simulation  program,  which  executes  a  set  of 
tasks  as  if  they  are  executed  simultaneously.  Actually,  SIMON  can  simulate  any 
program  running  on  a  multiprocessing  system  for  which  the  interconnection 
topology  is  specified.  This  simulator  has  also  been  successfully  applied  measur¬ 
ing  the  performance  of  algorithms  for  concurrent  circuit  simulation[3]  which  is 

N 

quite  different  from  those  in  the  signal  processing  systems. 

The  other  simulator,  BLOSIM  (BLOck  SIMulator[4-]),  is  a  discrete-time  time- 
driven  simulation  program.  It  is  used  to  simulate  sampled-data  systems  only 
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and  hence  is  not  so  general  as  SIMON.  However,  it  is  very  efficient  for  simulating 
systems  such  as  digital  filters  and  other  signal  processing  systems  with  a  regu¬ 
lar  sampling  rate. 

The  structure  of  a  general  multiprocessing  system  can  range  from  a  collec¬ 
tion  of  general  programmable  processors  to  to  an  interconnection  of  dedicated 
hardware  elements,  such  as  array  processors,  systolic  arrays  etc..  A  general 
multiprocessing  system  is  usually  modeled  as  a  combination  of  calculating  and 
switching  processors.  The  switching  processors  are  devoted  to  the  data  transfer 
among  computing  processors  only:  hence,  they  do  not  contribute  to  the  speedup 
directly.  The  advantage  of  this  model  is  that  any  network  can  be  simulated  by 
inserting  a  proper  switch  network.  On  the  other  hand,  the  dedicated  structure 
can  perform  a  specific  task  in  a  more  efficient  fashion. 

1.2.1.  SIMON 

The  simulator  consists  of  three  components  (See  Figure  1-1),  the  applica¬ 
tion  program,  the  simulator  base,  and  the  switch  model.  The  application  pro¬ 
gram  consists  of  a  number  of  tasks,  or  equivalently  processes,  which  are  filter 
routines  in  our  case.  The  simulator  kernel  time -multiple  xs  execution  of  the 
tasks  on  the  host  computer.  The  kernel  also  keeps  track  of  time  for  each  task 
to  ensure  that  interactions  among  tasks  are  simulated  in  the  proper  time 
sequence.  Each  task  has  its  own  clock  which  advances  as  the  task  executes. 
Finally,  the  switch  model  provides  a  fixed  virtual  circuit  communication 
mechanism  among  the  tasks  and  simulates  message  passing  between  proces¬ 
sors. 

The  most  important  features  of  the  simulator  are  summarized  as  follows: 

(1)  SIMON  provides  the  timing  statistics  of  each  task,  which  includes  the  run¬ 
ning  time  as  well  as  the  blocked  time,  once  the  processor  speed  is  given.  If 


"Figure  1-1  Program  Structure  of  SIMON 


the  processor  speed  Is  not  specified,  a  VAX  machine  is  assumed  by  default. 
This  information  is  useful  in  measuring  the  speedup  and  in  giving  insight  in 
the  processor  usage  efficiency. 

(2)  SIMON  also  shows  the  average  traffic  statistics  for  each  communication  link. 
Upon  request,  it  can  even  provide  temporal  traffic  statistics.  This  informa¬ 
tion  is  very  helpful  in  designing  the  topology  and  the  switching  network, 
since  using  this  information  the  designer  can  dynamically  allocate  the 
route  for  each  message  transmission  to  avoid  the  heavy  traffic  congestion. 

(3)  SIMON  allows  the  user  to  specify  a  constant  delay  for  each  message 
transmission.  Along  with  the  capability  of  generating  the  timing  informa¬ 
tion,  it  can  effectively  measure  the  susceptibility  of  each  algorithm  to  inter- 
processor  communication  delay.  Although  constant  delay  is  not  a  good 
assumption  for  the  real  data  transfer,  it  does  give  us  some  insight  of  the 
susceptibility  of  the  throughput  rate  of  an  algorithm  to  the  transmission 
delay.  A  better  model  will  be  available  when  the  design  of  the  topology  and 
switch  network  is  done. 

(4)  SIMON  permits  the  user  to  create  a  timing  file  to  force  instruction  times  to 
approximate  the  actual  target  processor  on  which  the  application  program 
is  going  to  run.  Hence,  the  simulation  is  not  restricted  to  a  target  VAX 
machine  only. 

The  tasks  in  SIMON  are  interconnected  through  first-in  first-out  (FIFO) 
buffers.  Connection  of  two  tasks  is  accomplished  by  a  naming  convention:  an 
output  FIFO  will  be  connected  to  all  the  input  FIFO's  with  the  same  name.  Data 
transfers  between  a  processor  and  ar.  input  or  an  output  FIFO  are  achieved  by 
calling  two  function,  namely  get()  and  put(),  respectively. 
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1.2.2.  BLOSIU 

BLOSIM  is  efficient  for  simulating  systems  which  operate  on  data  at  regular 
time  intervals.  Hence,  it  can  effectively  simulate  sampled  data  systems,  such  as 
digital  filters.  It  can  even  accommodate  system  with  different  and  asynchronous 
sampling  rates  present  simultaneously.  The  user  partitions  the  system  into 
small  pieces  implementing  elementary  parts  of  the  system,  each  piece  called  a 
"block".  The  user  also  provides  a  program  which  defines  the  topology  of  inter¬ 
connection  of  those  blocks.  The  actual  interconnection  is  handled  by  BLOSIM  in 
a  similar  fashion  as  SIMON.  The  difference  is  that  the  switch  model  is  not 
required  and  the  FIFO's  are  usually  of  finite  length,  whereas  in  SIMON,  we  can 
assume  infinite  length  FIFO's.  If  only  the  function  of  an  algorithm  is  simulated 
and  the  completion  time  of  each  task  is  not  needed,  BLOSIM  is  more  efficient 
than  SIMON. 

1.3.  Overview 

The  signal  processing  system  can  be  implemented  on  either  off-the-shelf 
programmable  chips  or  a  dedicated  hardware  circuit  Each  part  of  the  circuit 
can  be  a  very  complex  microprocessor  with  memory  or  a  very  simple  logic  cir¬ 
cuit  such  as  an  ALU  plus  some  registers.  In  the  later  chapters,  they  will  be 
referred  to  as  Processing  Elements  (PE’s). 

A  signal  processing  system  is  realized  on  an  interconnected  set  of  these 
PE’s,  which  can  be  structured  in  any  configuration.  The  structure  can  vary  from 
a  very  general  one,  which  is  referred  to  as  a  multiprocessor  structure,  to  a  well- 
defined  structure,  such  as  tree,  ring,  pipelining  and  parallel  structures  etc.. 

As  mentioned  before,  pipelined  and  parallel  structures  are  very  common  in 
signal  processing  and  very  efficient  for  VLSI  applications.  The  simple  pipelined 
structure  is  a  linear  array  of  interconnected  PE's.  Each  PE  fetches  the  output 
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from  the  previous  one  and  sends  its  output  to  the  next  PE  when  it  finishes  its 
computation.  Therefore,  this  structure  can  be  viewed  as  an  array  with  each  PE 
working  on  successive  input  samples.  Its  structure  will  become  clear  in  chapter 
2  when  the  cascade  form  of  a  digital  filter  is  discussed. 

On  the  other  hand,  a  parallel  form  also  composes  a  set  of  PE’s.  However,  all 
these  PE’s  operate  on  the  same  input  sample  simultaneously.  In  the  typical 
example  of  a  parallel  digital  filter  design,  which  will  be  shown  in  the  next 
chapter,  the  output  of  each  PE  is  sent  to  a  common  place  to  be  summed  up. 

1.3.1.  Objectives 

With  the  above  structures  in  mind,  we  will  demonstrate  various  combina¬ 
tions  of  algorithms  and  structures  which  can  efficiently  utilize  the  inherent 
parallelism  in  digital  signal  processing  systems.  Several  algorithms  for  the 
parallel  realization  of  specifically  digital  IIR  filters  will  be  presented  in  this 
dissertation.  An  IIR  (Infinite  Impulse  Response)  digital  filter  has  an  infinite 
length  for  its  impulse  response  and  usually  is  realized  by  the  recursive  tech¬ 
nique.  Digital  filters  are  widely  used  in  signal  processing  and  control  systems. 
Furthermore,  the  recursive  aspect  of  IIR  features  is  very  representative  of  sig¬ 
nal  processing  systems  and  presents  a  problem  for  parallel  processing.  Hence, 
designing  the  IIR  digital  filters  in  a  parallel  form  is  a  good  application  to  give 
insight  into  the  realization  of  parallelism  in  signal  processing  systems. 

On  the  other  hand,  FIR  digital  filters  are  easy  to  realize  in  a  parallel  form. 
Suppose  a  filter  is  realized  by  the  direct  form.  This  filter  can  be  duplicated  with 
as  many  sections  as  necessary,  with  each  section  calculating  one  output  sample. 


This  can  be  shown  as  in  Figure  1-2.  Due  to  its  feedback  free  phenomenon,  no 
communication  among  these  sections  is  required. 

'a 

Since  most  signal  processing  systems  require  real  time,  it  would  be  desir- 


able  to  find  algorithms  such  that  real  time  filtering  is  achievable^  no  matter  how 
fast  the  input  sampling  rate  or  how  slow  the  processor  speed  is.  For  those  non 
real  time  signals,  such  as  geophysical  signal  processing,  high  throughput  rate  is 
visually  desired.  Thus,  exploiting  parallelism  to  realize  a  high  speed  digital  filter 
is  our  main  objective. 

Once  these  parallel  algorithms  are  available,  efficient  parallel  structures 
should  be  developed  to  realize  these  algorithms.  Since  the  interconnecting 
wires,  whether  within  a  chip  or  among  chips,  are  expensive  in  VLSI  design,  a 
good  structure  should  have  interconnection  as  localized  as  possible.  Besides  the 
limitation  on  the  driving  capability,  the  global  communication  pattern  would 
also  increase  the  message  transmission  delay,  which  will  degrade  the  speed  per¬ 
formance. 

Finite  word  length  effect  is  another  important  issue  to  consider.  There  are 
two  major  considerations  in  this  area.  One  is  the  sensitivity  of  the  filter 
response  to  the  coefficients  error  which  results  from  the  finite  precision  of  the 
filter  coefficients.  This  error  will  sometimes  cause  a  stability  problem.  If  there 
is  a  pole  close  to  but  still  within  the  unit  circle,  the  filter  might  become  unstable 
after  quantizing  the  coefficients  in  the  actual  implementation.  The  other  finite 
precision  problem  is  roundoff  noise  and  limit  cycles  generated  from  the  internal 
arithmetical  .operations.  These  operations  usually  require  higher  precision  for 
the  outcomes  than  the  inputs.  Roundoff  noise  and  limit  cycles  will  occur  if  all 
values  are  represented  with  the  same  precision  by  rounding  the  outcomes.  The 
difference  between  roundoff  noise  and  limit  cycles  is  that  the  former  results 
from  uncorrelated  noise  sources  while  the  later  from  the  correlated  noise 
sources.  Dynamic  range  is  another  effect  which  is  closely  related  to  the 
roundoff  noise.  Thus,  a  good  parallel  algorithm,  in  addition  to  speedup,  should 
also  improve  or  at  least  not  adversely  impact  the  effect  of  finite  word  length. 


Before  actually  implementing  the  filter,  the  user  has  to  fully  understand  the 
range  of  algorithms  and  then  choose  the  most  appropriate  one.  This  increases 
the  design  time.  Programming  a  large  number  of  PE’s,  if  they  don’t  have  identi¬ 
cal  programs,  is  also  involved.  Therefore  automatic  program  generators  for 
some  algorithms  are  also  developed  to  help  design  parallel  digital  filters.  Once 
the  filter  response,  input  sampling  rate  and  chip  speed  are  given,  the  program 
generator  is  able  to  first  choose  the  appropriate  structure,  to  generate 
coefficients,  and  then  to  write  the  programs  running  on  SIMON. 

1.3.2.  Scope  of  the  Dissertation 

In  the  next  chapter,  various  parallel  algorithms  and  structures  for  the  reali¬ 
zation  of  IIR  digital  filters  are  presented.  These  algorithms  can  be  divided  into 
two  categories;  namely,  single-input  single-output  (SISO)  and  multi-input  multi¬ 
output  (MIMO)  systems.  A  highly  efficient  structure  which  uses  extensive  pipe¬ 
lining  and  multiprocessing  is  described  in  chapter  3.  This  high  speed  structure 
is  realized  in  the  state  space  domain  rather  than  in  the  usual  I/O  domain.  The 
performance  of  all  the  algorithms  mentioned  in  chapters  2  and  3  is  analyzed  and 
compared  in  chapter  4.  The  performance  comparison  concerns  the  speed  limi¬ 
tation,  transmission  delay  effect  on  speed,  latency  consideration,  computational 
rate  between  FIR  and  HR  filters  and  the  efficiency  of  the  processor  usage.  An 
important  filter  design  parameter,  roundoff  noise,  is  discussed  for  the  block 
state  filters  in  chapter  5.  In  chapter  8,  simulation  results  are  shown  which  verify 
the  analysis  described  in  chapters  4  and  5. 


CHAPTER  2 


PARALLEL  UR  FILTER  DESIGN 

Unlike  Finite  Impulse  Response  (FIR)  filters,  it  is  not  obvious  how  we  can 
design  a  high  speed  HR  filter  utilizing  parallelism.  The  difficulty  arises  because 
the  output  from  an  HR  filter  depends  not  only  on  the  current  and  previous  input 
samples  but  also  on  the  previous  output  samples.  This  feedback  would  seem  to 
put  an  upper  bound  on  the  speed  of  operation.  We  will  concentrate  here  on  the 
IIR  filter  implementation  only;  however,  all  the  algorithms  presented  below  can 
also  be  applied  to  FIR  filters  as  a  degenerate  special  case. 


2.1.  Introduction 


In  this  and  the  following  chapters,  assume  the  filter  to  be  implemented  is 
characterized  by  difference  equation  (2-1).  unless  otherwise  specified. 

Vn  ~  ^5  &t  Vn-4  +  i!  (2-1) 

<*1  <*0 

where  the  coefficients  are  real  numbers.  Taking  the  z  transform  on  (2-1),  the 
transfer  function  is 


#(*)  = 


2SsLb 

*(*) 


jo**"4 

<«o 

l-fbtt-1 


(2-2) 


<«i 

There  are  two  interesting  cases  in  equation  (2-1).  One,  the  finite  impulse 
response  (FIR)  filter  corresponds  to  N=0.  The  other,  the  infinite  impulse 
response  (IIR)  filter,  corresponds  to  Kill,  N>0  and  b^O.  The  tradeoff  between 
these  two  cases  is  roughly  that  to  approximate  a  given  desired  response,  the  FIR 
filter  requires  a  much  larger  order  X  and  hence  a  larger  computation  rate  (as 
measured  by  multiplies  and  adds  per  output  sample).  However,  when  the  sam- 


pling  rate  at  which  a  filter  can  be  implemented  for  a  given  speed  of  hardware  is 
considered,  the  FIR  filter  appears  at  first  examination  to  be  faster  because  of 
the  natural  way  in  which  parallelism  can  be  exploited. 

Besides  duplicating  a  filter  section  implementing  equation  (2-1)  as  shown  in 
Figure  1-2,  another  method  for  achieving  speed  with  an  FIR  filter  is  simply  to 
calculate  a  vector  of  L  successive  output  samples  in  parallel.  To  see  this,  define 
output  and  input  vectors  of  L  successive  samples  as 

s  '  *  *  »!/(n*i)Z-l  (2-3a) 

f  ■  f 

~  J2ivZ.aiiZ+l.  *  *  *  tx(n*l)L-l  (2*3b) 

where  n  denotes  the  block  number.  If  we  take  LfcM  for  an  FIR  filter,  then  it  is 

easy  to  see  that  the  filter  can  be  represented  as 

Yn  -  AXn  *  BXn-i  (2-4) 

where  A  and  B  are  appropriate  LxL  matrices.  The  form  of  the  filter  given  by  (2- 

1)  we  refer  to  as  S1S0  (single-input  single-output)  and  (2-4)  is  referred  to  as 

MIMO  (multiple-input  multiple-output).  The  MIMO  system  is  shown  schematically 

in  Figure  2-1.  A  detailed  discussion  of  the  realization  of  both  S1S0  and  MIMO  FIR 

filters  will  be  given  in  the  next  chapter. 

Similar  to  FIR  filters,  the  algorithms  of  IIR  filters  can  also  be  divided  into 
SISO  and  MIMO.  In  this  chapter,  various  parallel  algorithms  for  realizing  equa¬ 
tion  (2-1)  with  NasM  will  be  presented.  The  best  known  cascade  and  parallel 
forms,  which  will  be  used  as  references  and  compared  to  the  other  structures, 
will  be  briefly  explained  in  the  first  section.  The  remaining  parallel  algorithms 
will  be  presented  subsequently.  Their  performance  as  measured  by  the  overall 
execution  time  with  and  without  the  message  transmission  delay  will  be 
analyzed  in  the  following  chapters  and  simulation  results  will  also  be  presented 
to  verify  our  analyses. 


Figure  2-1  Block  Diagram  of  MIMO  Filters 


A  filter  with  M  >  N.  can  be  realized  with  an  FIR  filter  cascaded  with  an  HR 
filter  which  satisfies  the  above  criteria.  Equation  (2-2)  then  becomes 


t*i 

y) 

where  a*  =  2j 

j*max[ 0.  i-U-M) 

Since  the  parallel  implementation  of  the  first  term  in  the  right  hand  side  is 
straightforward,  the  greatest  effort  will  be  devoted  to  the  implementation  of  the 
second  term,  which  is  an  IIR  filter. 


Sections  2-2,  2-3  and  2-4  will  be  devoted  to  the  algorithms  implementing 
SISO  filters.  In  Section  2-2.  the  well-known  Cascade  and  Parallel  forms  will  be 
treated.  A  special  architecture,  systolic  array,  will  be  discussed  afterwards. 
Finally,  a  high  speed  Single  Instruction  Multiple  Data  mode  algorithm  will  be 
dealt  with  in  Section  2-4.  MIMO  filters  will  be  discussed  in  Sections  2-5  and  2-6. 


Before  discussing  the  cascade  and  parallel  forms  of  a  digital  IIR  filter,  a 
brief  definition  of  realization  and  structure  of  a  filter  will  be  given,  because  they 
are  going  to  be  used  over  and  over  in  the  following  discussion.  Realization  of  a 
digital  filter  is  any  configuration  of  a  hardware  or  a  set  of  programs  implement¬ 
ing  a  set  of  arithmetic  operations  and  delay  elements  which  can  achieve  the 
transfer  function  as  in  (2-2).  Structure  is  a  special  circuit  configuration  or  a 
specific  sequence  of  instructions  in  a  program  to  realize  a  filter.  For  example, 
direct,  cascade  and  parallel  forms  are  three  different  structures  of  a  digital 
filter.  However,  all  these  structures  can  be  used  to  realize  the  same  filter  equa¬ 
tion. 

2.2.  Cascade  and  Parallel  Forms 

In  conventional  filter  design,  the  most  popular  structures  are  the  cascade 
and  parallel  connection  of  2nd  and/or  1**  order  filters.  They  are  popular  not 
only  because  they  are  modular,  but  also  because  they  are  insensitive  to  roundoff 
noise.  Their  modularity  makes  them  easily  expandable  to  any  size  and  their 
insensitivity  guarantees  a  large  dynamic  range  with  a  minimum  number  of  bits 
or  accuracy  in  the  architecture. 

2.2.1.  Cascade  Form 

For  a  filter  with  real  coefficients,  we  can  factor  the  numerator  and  denomi¬ 
nator  of  equation  (2-2)  into  a  product  of  lrt  and/or  2nd  order  polynomials  while 
keeping  all  the  coefficients  real.  Thus,  equation  (2-2)  c an  be  rewritten  as 

**•>■52$-  (2-s) 

Each  term  in  the  product  is  the  transfer  function  of  a  second  order  filter  for  a 
complex  conjugate  pair  of  poles  or  a  first  order  filter  for  a  real  pole.  Whenever 
possible,  we  can  also  combine  two  real  poles  to  form  a  second  order  filter.  Fig¬ 
ure  2-2  shows  the  structure  of  this  form. 
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The  problems  left  are  the  pairing  of  poles  and  zeros  and  the  ordering  of 
those  second  and  first  order  filters.  Although  neither  pairing  nor  ordering  will 
affect  the  overall  transfer  function  and  the  overall  speed,  they  will  usually  affect 
the  roundoff  noise  behavior[5]. 


2. 2.2.  Parallel  Form 

To  realize  a  filter  in  a  parallel  form,  we  have  to  first  obtain  a  partial  frac¬ 
tional  expansion  form  of  the  original  transfer  function  and  then  realize  each 
term  independently.  The  output  from  each  section  is  sent  to  an  adder  to  sum 
up  in  order  to  get  the  final  result.  If  all  the  poles  are  simple,  equation  (2-2)  can 
be  expressed  as 


(2'6) 

The  numerator  of  each  term  in  the  summation  is  a  polynomial  with  degree  one 
less  than  that  of  the  denominator.  For  a  complex  conjugate  pair,  the  denomina¬ 
tor  is  a  polynomial  of  degree  2,  while  for  a  real  pole  it  is  of  degree  1.  d  is  non¬ 
zero  if  M=N  and  zero  if  M<N.  This  constant  term  can  also  be  distributed  to  each 
filter  section.  Then,  equation  (2-6)  becomes 


H{z)  =  f 
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•where  ^d^-d.  The  structure  of  this  form  is  shown  in  Figure  2-3. 


(2-7) 


The  conclusion  that  each  term  in  the  summation  in  (2-6)  is  either  a  1**  or  a 
filter  is  based  on  the  assumption  that  all  the  poles  are  simple.  If  any  poles 
with  multiplicity  greater  than  one  exist,  some  denominators  must  have  a  degree 
higher  than  2  for  complex  pole  pairs.  Suppose  an  B0*  order  filter,  which  has  two 
simple  complex  conjugate  pairs  of  poles  and  a  multiple  order  complex  conjugate 
pole  with  multiplicity  2,  is  realized  in  a  parallel  form.  The  transfer  function  of 
this  filter  can  be  represented  by  a  summation  of  three  terms  as  shown  in  (2-8). 


>  i.V  S.V 
iliVvVV,  v's, 


(*-pi)(*-pl#)  (2-?z)(2-?2)  (*-?>3)2(z“P3)Z 

where  the  first  and  the  second  terms  are  second  order  filters  and  the  third  term 
is  a  fourth  order  filter.  The  filter  structure  can  be  drawn  as  in  Figure  2-4a. 
where  block  3  is  a  fourth  order  filter.  It  is  obvious  that  the  overall  speed  is 
governed  by  this  fourth  order  filter  instead  of  a  second  order  section.  If 
transforming  the  third  term  in  (2-3)  into  a  product  of  two  second  order  filters  as 
follows 


Ns(*)  A'51(z)  A’«(«) 

(2-Pa)2(2-P3)Z  (*-Ps)<2-P3)  (2-Ps)(*-P3) 

where  Nyx(z)Nsz(z)  ~  Ns(z),  the  overall  speed  again  depends  on  a  2nd  order 

filter,  if  delay  lines  are  introduced  after  blocks  1  and  2.  The  overall  structure  is 

shown  in  Figure  2-4b.  The  dotted  squares  are  delay  elements  and  the  solid 

squares  are  2nd  order  sections. 

An  alternative  method  is  to  transform  (2-3)  into  a  product  of  a  6tK  order 
filter  and  a  2"*  order  filter,  which  can  be  represented  by  equation  (2-9). 
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A  structure  that  realizes  (2-9)  is  shown  in  Figure  2-5  with  each  square  represent¬ 
ing  a  2nd  order  filter.  This  structure  is  very  important  in  achieving  high  speed 
block  filters  which  are  insensitive  to  the  message  transmission  delay  and  will  be 
shown  in  Chapter  3. 


Applying  the  above  technique  to  all  the  multiple  poles,  (2-7)  can  be  rewrit- 


where 


M*) 


is  the  transfer  function  of  a  2nd/!*1  order  filter.  And  K  is  the  max¬ 


imum  multiplicity  among  all  the  multiple  poles  and  Pj  is  the  number  of 
subfilters  in  the  3th  stage.  Thus,  the  minimum  number  of  stages  required  to 


achieve  this  second  order  filter  connection  is  the  maximum  multiplicity  among 


all  the  multiple  order  poles. 


In  Figures  2-2  and  2-3,  the  structure  of  each  2nd  order  subfilter  is  not 
shown,  because  there  are  many  different  structures.  Some  canonical  forms 
require  less  computation  and  hence  result  in  faster  throughput  However, 
roundoff  noise  and  coefficient  sensitivity  might  be  a  problem.  Some  other  reali¬ 
zations  can  assure  us  low  roundoff  noise,  but  usually  require  more  computation 
and  hence  lower  speed.  On  the  other  hand,  since  lower  roundoff  noise  implies 
fewer  bits  are  required  to  represent  each  number,  each  computation  takes  less 
time.  This  might  compensate  the  speed  degradation  problem.  Therefore,  a 
designer  has  to  consider  the  tradeoff  between  roundoff  noise  and  speed.  How¬ 
ever,  when  considering  the  actual  implementation,  lower  roundoff  noise  implies 
fewer  bits  are  required  to  represent  each  number.  Hence,  lower  roundoff  noise 
also  implies  less  computation  which  may  compensate  the  lower  speed  resulted 
from  the  increased  number  of  operations.  These  issues  will  receive  detailed  con¬ 
sideration  in  Chapter  5. 


2.3.  Systolic  Arrays 

A  systolic  array  is  a  network  of  interconnected  PE’s  which  rhythmically 
computes  and  passes  data  through  its  PE's.  It  is  one  alternative  realization 
which  must  be  considered  when  realizing  a  digital  filter.  This  array  usually  con¬ 
nects  only  a  few  types  of  simple  processors  which  are  sometimes  called  cells. 
The  data  flow  pattern  through  these  cells  is  usually  simple  and  regular  so  that 
cells  can  be  connected  by  a  network  with  local  and  regular  interconnections. 
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This  array  can  sustain  very  high  throughput  rate  because  it  uses  extensive  pipe¬ 
lining  and  multiprocessing.  Furthermore,  the  architecture  of  every  cell  is  very 
simple  and  each  cell  is  efficiently  used.  The  low  design  cost  resulting  from  the 
replication  of  identical  cells,  together  with  the  local  interconnection  make  this 
architecture  extremely  suitable  for  VLSI  design. 

The  systolic  array  was  first  proposed  for  the  implementation  of  matrix 
operations  in  VLSI  by  Kung  and  Lieserson[6].  For  applications  other  than  matrix 
operations,  the  function  of  the  basic  cell(s)  and  the  interconnection  among  the 
cells  have  to  be  characterized.  For  the  implementation  of  an  HR  digital  filter  as 
in  (2-1),  Kung[7]  defined  two  basic  cells  which  are  drav.-n  as  in  Figure  2-6a.  The 
interconnection  of  cells  for  implementing  a  fourth  order  filter  is  shown  in  Figure 
2-6b.  The  implemented  filter  can  be  represented  by  the  following  equation 

Vn  =  £bxyn-i  + 

»»l  t*0 

Each  cell  in  the  array  of  Figure  2-6b  contains  two  coefficients.  One  of  them 
multiplies  the  x  input  and  the  other  one  multiplies  the  y  input.  The  input  sam¬ 
ples,  \xn  {>  enter  this  array  from  the  left  The  output  samples.  traveling 
through  this  array  from  the  rightmost  cell  at  the  same  speed  as  the  input 
sequence.  Each  Yi  is  initialized  to  zero  when  entering  the  array  from  the  right¬ 
most  cell.  Marching  along  the  array,  it  accumulates  terms  from  right  to  left  and 
eventually  obtains  its  final  value  when  reaching  the  left-most  cell.  The  final 
value  of  Yi  is  also  fed  back  to  the  array  for  use  in  computing  1<* i  to  Yi+ 4.  The 
feedback  sequence  is  labeled  as  {ynj  to  distinguish  from  the  output  samples. 

Each  cell  except  the  first  one  communicates  only  with  its  right  and  left 
neighbors,  while  the  first  cell  talks  only  to  its  right  neighbor.  No  global  com¬ 
munication  is  required  anywhere  in  this  architecture.  Since  every  two  adjacent 
samples  of  both  Xi  and  yt  sequences  are  separated  by  two  clock  cycles  to  ensure 
that  each  ^  meet  every  yi,  it  is  clear  that  only  half  of  the  cells  in  the  array  are 


active  at  any  given  time.  Thus,  we  can  combine  two  adjacent  cells  into  one  so  as 
to  fully  utilize  the  hardware  resources. 

As  for  FIR  filters,  their  implementation  is  equivalent  to  the  realization  of 
finite  convolutions.  Kung[9]  mentioned  several  different  ways  of  realizing  finite 
convolutions  with  systolic  approaches.  One  type  of  realization  uses  the  same 
structure  as  in  Figure  2-6b,  except  that  each  cell  has  only  one  coefficient  and  no 
feedback  links  exist.  Further,  the  input  samples  travel  twice  as  fast  as  the  out¬ 
put  samples.  The  realization  of  FIR  filters  will  be  discussed  in  detail  in  the  next 
chapter. 

2.4.  SSQED  Mode 

Yet  another  possibility  for  digital  filter  realization  in  the  Skewed  Single 
Instruction  Multiple  Data  (SSIMD)  mode,  in  which  exactly  the  same  arithmetic 
operation  is  executed  on  a  set  of  identical  processing  elements.  The  starting 
time  of  each  PE  is  skewed  by  a  fixed  amount  to  work  on  successive  input  sam¬ 
ples.  This  implementation  can  be  applied  to  any  signal  flow  graph  including  mul¬ 
tiplication,  addition  and  delay.  Bamwell[9]  decomposed  equation  (2-1)  into  a  set 
of  two  equations  as 

*n  =  £b  ir«H+*n  (2-1  la) 

iml 

V«  =  (2-llb) 

i»Q 

This  is  equivalent  to  decomposing  the  filter  into  an  all  pole  filter  followed  by  an 
all  zero  filter.  Both  equations  are  executed  on  every  PE  to  generate  not  only  the 
output  sample  yn  but  also  the  output  sample  rn  from  the  all  pole  section.  PE’s 
are  skewed  in  time  to  work  on  Inputs  and  outputs  of  different  time  indices.  The 
delayed  versions  of  rn  in  each  PE  are  not  computed  internal  to  that  PE,  but  are 
supplied  from  other  PE’s  executing  the  same  code.  Therefore,  each  PE  has  to 
fetch  not  only  the  Input  samples  but  also  the  intermediate  outputs  from  some 


other  PE's. 


The  fundamental  concept  of  this  implementation  is  illustrated  by  a  second 
order  filter  example  shown  in  Figures  2-7.  In  this  example,  the  second  order 
direct  form  filter  of  Figure  2-7a  is  implemented  on  a  single  PE  according  to 
equations  (2-11)  where  r„_j  and  rn_2  are  computed  internally  in  the  previous 
cycles.  Figure  2-7b  shows  the  function  of  one  PE  in  a  multiprocessor  realization, 
•where  rn_t  and  rn_2  are  extra  input  samples  and  rB  is  an  extra  output  sample. 
Figures  2-8  shows  an  example  of  implementing  this  second  order  filter  on  two 
end  four  PE's. 

In  Figure  2-8a,  a  two  processor  realization  is  illustrated.  The  mean  reason 
that  multiprocessing  is  possible  is  that,  for  PE  1,  even  though  the  value  of  rm_j 
must  be  available  before  we  can  obtain  rn,  it  is  not  necessary  for  it  to  be  avail¬ 
able  before  the  computation  of  rn  is  started.  What  is  required,  rather,  is  that 
the  value  of  rn_x  must  be  available  before  it  is  used  by  PE  1.  Kence,  PE  1  may 
start  computing  before  yn_l  and  even  rn_j  are  available.  The  availability  of  rn_2 
is  not  a  problem,  since  it  is  computed  by  the  same  PE  in  the  previous  cycle; 
hence,  it  is  always  available  when  we  start  computing  rn.  On  the  other  hand,  for 
a  four  processor  implementation  as  in  Figure  2-8b,  PE  1  may  start  computing  as 
soon  as  it  is  guaranteed  that  both  rn_2  and  rB_j  are  available  when  they  are 
needed.  If  the  availability  of  these  two  values  gives  different  constraints  on  the 
starting  time,  we  have  to  choose  the  later  starting  time  for  an  obvious  reason. 
This  argument  should  be  easily  extended  to  filters  of  any  order.  For  an 
order  filter,  the  availability  of  rB_j  through  in  each  PE  has  to  be  con¬ 

sidered,  and  then  the  latest  starting  time  is  chosen. 

This  implementation  has  two  good  features.  First,  since  it  is  a  single 
instruction  multiple  data  mode,  only  one  program  has  to  be  generated  for  all 
the  PE's.  This  is  a  good  property  especially  when  a  large  number  of  PE’s  is 


involved.  The  other  advantage  of  this  mode  is  that,  the  data  precedence  rela- 
tions  among  PE's  are  automatically  maintained  by  the  intrinsic  synchrony  of  the 
system.  However,  the  irregular  and  non-local  data  Sow  pattern  can  cause  a  seri¬ 
ous  problem  in  speed  performance,  as  will  be  discussed  in  Chapter  4. 

All  the  algorithms  mentioned  above  concerned  SISO  filter  design  only,  and 
no  M1M0  filters  have  been  discussed  yet.  The  processing  of  the  input  samples  in 
SiSO  filters  is  on  a  sample-by-sample  basis.  In  the  rest  of  this  chapter,  effort  will 
be  devoted  to  MOMO  filter  design.  The  input  samples  will  be  processed  by 
blocks,  which  is  the  name  block  processing  named  for. 

2.5.  Block  Processing 

Similar  to  the  FIR  filter  design,  I1R  filters  can  also  be  realized  in  a  block 
form  as  shown  in  Figure  2-1.  This  is  done  by  accumulating  the  input  samples  in 
a  buffer  of  size  L  and  then  process  these  samples  simultaneously.  Historically, 
the  preliminary  motivation  behind  the  use  of  block  processing  was  the  possibil¬ 
ity  of  employing  FFT  techniques  for  intermediate  computation.  Although  com¬ 
munication  is  a  serious  problem  for  FFT  in  VLSI,  structures  of  block  implemen¬ 
tation  exist  which  are  very  efficient  for  multiprocessing.  We  will  discuss  an 
input-output  formulation  in  this  section  and  6  state  equation  formulation  will  be 
given  in  the  next  section. 

StockhamflOj  has  shown  that  filters  with  zeros  alone  can  be  synthesized  by 
using  the  FFT  algorithm.  Hence,  the  FIR  filters  can  be  realized  efficiently  by  this 
fast  algorithm.  Although  we  once  assumed  that  only  recursive  techniques  can 
be  employed  for  IIR  filters,  Cold  and  Jordan[ll]  proved  that  this  was  not  true. 
They  have  shown  that  HR  filters  can  be  synthesized  by  a  combination  of  three 
finite  convolutions,  assuming  some  initial  conditions.  Voelcker[l2]  considered 
the  same  problem  using  the  z  transform  and  showed  that  recursive  filters  could 
be  realized  by  combination  of  finite  convolutions  and  block  feedback.  However, 


his  algorithm  introduced  additional  poles,  which  might  cause  a  stability  prob¬ 
lem. 


Later,  Burrus[l3]  developed  a  block  feedback  structure  in  time  domain 
based  on  the  matrix  representation  of  convolutions.  This  algorithm  does  not 
cause  any  stability  problems  since  the  extra  poles  are  always  located  at  the  ori¬ 
gin.  Mitra[l4]  later  showed  that  Gold  and  Jordan's  formulation  is  actually  a  par¬ 
ticular  case  of  this  algorithm.  Since  Burrus'  algorithm  is  very  efficient  for  paral¬ 
lel  processing,  a  detailed  discussion  will  be  given  here. 


2.5.1.  Basic  Block  Feedback 

To  ease  notation,  let  us  consider  a  third  order  filter  with  transfer  function 
as  in  equation  (2-12). 


.  -  gs+a1e~1+q22~2+a3Z  ~3 

'2  ” 


(2-12) 


In  time  domain,  the  filter  can  be  represented  in  a  matrix  form  as  in  equation  (2- 
13).  Notice  that  both  matrices  grow  indefinitely  until  the  input  samples  are 
exhausted. 
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equation  (2-13)  can  be  rewritten  as  in  (2-15). 


Equating  both  sides  on  the  row  containing  3^.n,  the  following  recursive  equation 
can  be  obtained. 

^C^n-M  +  In  —  j4&An  +  i  +  A\Xf i 
Since  Bo  is  nonsingular,  multiply  5c1  on  both  sides  and  obtain 

Yn„  =  -Fo1*,  Yn  +  Be'AcXm  +  BoxAxXn  (2-16) 

The  conversion  from  equation  (2-13)  to  (2-15)  is  true  if  the  size  of  the  four 

matrices  in  (2-14)  exceeds  the  filter  order.  Thus,  equation  (2-16)  holds  for  any 

block  sizes  not  smaller  than  3. 

Equation  (2-16)  can  be  applied  to  filters  of  any  order  if  we  define  the  input 
and  output  vectors  as  in  (2-3)  and  the  four  coefficient  matrices  in  (2-17).  In  (2- 
17),  we  also  assume  that  the  block  size  is  not  smaller  than  the  filter  order.  The 
block  diagram  of  this  structure  is  shown  in  Figure  2-9.  Equation  (2-16)  can 
represent  various  types  of  filters.  For  an  FIR  filter.  Bo  is  an  identity  matrix  and 
Bi  is  a  null  matrix,  hence,  the  equation  becomes 

=  AoXr^x  +  A\Xn 

which  is  equivalent  to  (2-4).  Obviously,  this  equation  is  true  only  if  the  block  size 
is  not  smaller  the  number  of  taps  of  the  filter. 
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16),  Mitra  and  Gnanasekaran  derived[l5]  three  other  state-structures  from  the 
same  equation.  Generally  speaking,  all  these  structures  are  similar  as  far  as 
throughput  rate  is  concerned.  Hence  re  will  discuss  only  the  structure  in  figure 
2-9. 


2.5.3.  Short  Block  Lengths 

The  basic  relations  become  a  little  more  complicated  when  the  block  size  is 
less  than  the  order  of  the  filter.  For  example,  if  the  block  size  is  2  for  the  exam¬ 
ple  in  the  previous  section,  two  more  matrices  ere  required.  Equation  (2-16)  can 
be  written  as 


=  -Bf'BtYn  -BZ'BzYn-i  + 

+  Bz*A\Xn  +  BzlAzXn.\ 
where  the  matrices  are  defined  as  follows 


(2-18) 
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This  can  be  extended  to  any  block  length,  if  more  feedback  terms  are  allowed. 
In  fact,  if  the  block  length  is  reduced  to  one,  equation  (2-18)  becomes  the  origi¬ 
nal  scalar  difference  equation.  Hence,  the  traditional  scalar  difference  equation 
is  a  degenerate  case  of  the  block  difference  equation.  Furthermore,  if  factoring 
and  partial  fractional  expansions  are  used  on  .the  original  scalar  equation,  we 
can  also  obtain  a  parallel  or  cascade  connection  of  second-order  block  filters. 
The  block  length  in  each  section  can  also  be  different  from  that  of  the  other  sec¬ 
tions  so  as  to  achieve  multi-rate  digital  filters. 


2.5.4.  Pole  Locations 


Taking  the  z-transform  of  equation  (2-16),  we  get 
zT(z)  =  -BoiB1r(z)  +  zBo'AoX(z)  +  B?AjC(z) 


(2-19) 


where  Y(z)  and  X{z)  are  the  z  transforms  of  the  vectors  Yn  and  Xn  respectively. 
The  unit  delay  z~l  now  represents  the  delay  time  of  a  block  of  samples,  which  is 
L  times  the  delay  in  the  scalar  case.  If  define  the  transfer  function  of  the  multi¬ 
input  multi-output  system  as 

T(z)  =  H(z)X(z) 

it  can  be  obtained  directly  from  (2-19). 

U(z)  =  [ffy(z)]  =  (zI+Bt'Bj-'izSe'Ac+B^Aj  (2-20) 

where  H^{z)  is  the  transfer  function  between  the  ifft  output  sample  and  the  j** 
input  sample  in  a  block.  From  the  above  equation,  it  is  obvious  that  the  poles  of 
the  block  filter  are  the  eigenvalues  of  the  product  matrix  Bq  xB\.  Mitra  and 
Gnanasekaran[l5]  verified  for  the  special  case  L  =  N  that  the  eigenvalues  of  this 
matrix  are  the  original  poles  raised  to  the  power  L  This  conclusion  can  also  be 
easily  verified  by  the  state  space  formulation  in  the  following  section.  However, 
from  equation  (2-17),  it  is  also  clear  that  the  size  of  the  product  matrix  grows 
with  the  block  length  1*  We  should  have  L  eigenvalues  from  this  matrix  instead 
of  N,  since  the  size  of  the  feedback  matrix  Bq1Bi  is  l*L.  The  extra  L-N+l  poles 
in  this  implementation  are  actually  all  located  at  the  origin.  This  is  easily  seen 
by  taking  a  closer  look  at  the  matrix  Bi.  The  left  most  L-N+l  columns  of  B\  are 
all  zero  vectors.  Thus,  the  left  most  L-N+l  columns  of  the  product  matrix  2?c XB\ 
are  also  zero  vectors,  which  in  turn  result  in  L-N+l  zero  eigenvalues. 

2.6.  Block  State  Space  Realization 

It  is  well  lmown[l8]  that  any  linear  shift  invariant  system  can  be 
represented  by  state  equations.  Although  requiring  more  computation  per  out¬ 
put  sample  compared  to  the  canonical  forms  as  well  as  to  the  forms  mentioned 
above,  the  state  space  design  can  result  in  very  high  speed.  Furthermore,  this 
high  speed  can  be  achieved  with  very  simple  interconnection  among  all  the  pro¬ 
cessing  elements.  A  detailed  discussion  about  the  speed  performance  will  be 


given  in  the  next  chapter.  In  this  section,  a  detailed  derivation  of  the  block 
state  equation  from  the  scalar  transfer  function  will  be  given. 

These  equations  can  tell  us  not  only  the  input-output  relationship  but  also 
the  internal  data  flow.  This  is  especially  beneficial  for  roundoff  noise  analysis, 
since  roundoff  noise  depends  heavily  on  the  structure  of  the  filter.  In  a  later 
chapter,  we  will  derive  the  roundoff  noise  power  at  a  filter  output  in  terms  of  the 
state  equation  coefficients. 

2.6.1.  Introduction 

The  minimal  state  representation  of  an  A’*'1  order  block  filter  as  in  Fig.  2-1 
can  be  written  as 

*„♦!  =  +  BXn  (2-2  la) 

Yn  =  CRn  +  DXn  (2-2  lb) 

where  A.  B.  C  and  D  are.  respectively.  NxN,  A <*L,  Lx  A  and  LxL  constant 

matrices.  /?n.  Xn  and  Yn  are.  respectively,  the  state  vector,  the  input  vector 

and  the  output  vector  at  time  n.  where  Xn  and  Yn  are  defined  as  in  equations  (2- 

3).  This  representation  can  be  uniquely  characterized  by  the  constant  matrices 

(A.B.C.D).  The  block  diagram  of  this  block  filter  is  shown  in  Figure  2-10. 

Actually,  the  state  matrix  A  can  be  of  any  size  greater  than  N,  if  introducing 
some  dummy  states  (poles).  Then  the  sizes  of  matrices  B  and  C  should  change 
accordingly.  This  size  change  is  acceptable  because  the  only  thing  that  matters 
for  digital  filters  is  the  zero-state  response,  or  equivalently,  the  transfer  func¬ 
tion.  This  state  equation  with  larger  matrices  will  give  us  the  same  transfer 
function  after  the  pole  and  zero  cancellation  and  hence  the  realization  is  not 
minimal.  Foe  instance,  the  direct  form  implementation  of  a  2nd  order  filter  as 
shown  in  Figure  2-7a  with  block  size  1  can  be  expressed  by  state  equations  with 
the  following  matrices 


Figure  2-10  Block  Diagram  of  Block  State  Filters 


C  =  |a0.ai,a2]  D  =  0  (2*22) 

We  will  concentrate  here  on  the  minimal  realization  only. 

The  state  matrix  A  plays  a  very  important  role  in  achieving  high  speed 
filters,  since  it  is  the  only  feedback  matrix  in  the  state  equations.  This  state 
matrix  can  vary  from  a  full  matrix  to  a  very  simple  one  such  as  the  Jordan  form. 
The  diagonality  of  the  Jordan  form  is  extremely  important  in  simplifying  the 
filter  structure,  since  the  operation  of  each  entry  on  the  diagonal  can  be  per¬ 
formed  independently.  Thus,  the  computation  path  on  the  feedback  term  can 
be  divided  to  several  parallel  paths  'with  much  less  computation  on  each  path. 
Since  the  execution  time  on  the  feedback  term  decides  the  filter  speed,  we  can 
easily  trade  hardware  with  speed  for  a  diagonal  state  matrix. 

A  diagonal  state  matrix  usually  introduces  complex  numbers  to  its  entries, 
since  the  diagonal  elements  are  equivalent  to  the  filter  poles  in  this  case.  Thees 


complex  numbers  unnecessarily  increase  the  computation  because  all  the  filter 
coefficients  as  well  as  the  input  data  are  real  numbers.  All  the  states  can  also 
be  real  numbers  if  choosing  the  correct  states.  Combining  each  complex  conju¬ 
gate  pair  of  poles,  a  complex  2x2  diagonal  submatrix  can  be  converted  into  a 
real  2x2  full  matrix.  This  'Kill  transform  the  state  matrix  from  a  diagonal  Jordan 
form  into  a  block  diagonal  form.  For  each  state,  a  complex  multiplication  is 
transformed  into  two  real  multiplications.  Kence,  the  computational  rate  is 
reduced  to  one  half. 

Although  these  equations  exist  in  theory,  they  are  not  easy  to  obtain 
directly  from  a  filter  difference  equation,  which  is  the  form  we  usually  have  for 
implementation  Fortunately,  computing  one  out  of  every  L  state  vectors,  the 
state  equations  can  be  easily  obtained  from  the  simple  state  equations,  which 
are,  in  turn,  fairly  easy  to  obtain  from  the  scalar  difference  equation.  If  syn¬ 
thesized  from  the  second  and/or  first  order  filters,  the  state  matrix  Kill  be 
block  diagonal.  Thus,  a  step-by-step  procedure  to  obtain  the  block  diagonal 
state  equations  from  the  S1S0  filter  design  will  be  shown  in  the  next  subsection. 

However,  the  four  matrices  for  a  given  filter  transfer  function  are  not 
unique.  Actually,  there  are  infinite  number  of  matrices  which  will  give  rise  to 
the  same  transfer  function  by  choosing  different  states.  Even  if  restricted  to 
the  minimal  realization,  as  will  see  later,  there  is  still  a  lot  of  freedom  to  choose 
the  coefficients.  Filters,  which  have  a  block  diagonal  state  matrix  and  low 
roundoff  noise,  will  be  derived  in  Chapter  5.  In  this  section,  synthesizing  a  block 
state  filter  from  scalar  second  order  filters  will  be  shown.  The  realization  of  a 
low  noise  second  order  filter  will  also  be  given  in  Chapter  5. 

2.6.2.  Single  Input  Single  Output  Filter 
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2.6.2.I.  High  Order  Filters 

The  minimal  S1S0  state  space  representation  of  an  Nth  order  filter  can  be 


■written  as 


*  Arn  +  ix„  (2-23a) 

Vn  -  ern  +  dxn  (2-23b) 

where  x„  and  yn  are  scalars  rather  than  vectors.  Taking  the  z  transform  on 

equations  (2-23).  we  get 


zR(z)  =  AR(z)  +  bX(z) 
Y(z)  =  cR(z)  +  dX(z) 

Substituting  the  R(z)  of  (2-24)  into  (2-25),  we  get 


(2-24) 

(2-25) 


*(*)s  M$rsc(zI-A)~lb +d 


(2-26) 

For  a  given  filter  transfer  function,  if  all  the  poles  are  distinct,  one  can  do 
partial  fractional  expansion  on  the  original  filter  such  that  the  transfer  function 
is  a  summation  of  several  second  and/or  first  order>filters.  For  the  case  of  all 
distinct  complex  poles,  this  transfer  function  can  be  written  as 

H{Z ]  =  d  *  I,  (2'27) 

where  P  is  the  number  of  poles  in  the  upper  half  z-plane.  Each  term  in  the  sum¬ 
mation  can  be  realized  in  the  state  space.  Suppose  the  state  space  realization 
of  the  term  is  characterized  by  (At,bi,ci,0),  where  Ai  is  a  2x2  matrix.  These 
three  matrices  can  be  related  to  equation  (2-27)  by 

aj+afr-1 


ci(zIz~Ai)~lbi  - 


(2-28) 


1  -fiz 

where  /2  is  a  2x2  identity  matrix.  Then,  it  is  straightforward  to  obtain  the  high 
order  filter  (A,b,c,d)  by  defining  the  filter  coefficients  as 


A1  0  .  0 
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c=[c\  cz, . cp]  (2-29) 

This  can  be  easily  verified  by  plugging  these  coefficients  into  equation  (2-26)  and 

then  comparing  with  equations  (2-27)  and  (2-23). 

H{z)  =  c{,zl-A)~lb  +  d 

- A T1  0  .  0  f6i 

0  (zIz-A2)-1  .  0 

+  d 

0  0  .  (zI2-Ap)-'  6-d 

6l 

=  [c'^T1.  C *(zIz-A*rl . Cp(zl2-Ap)~l]  b 

bp 

=  d  +  ^ci(2/2-^<)-16< 

1 

This  is  exactly  the  same  as  equation  (2-27)  if  replacing  all  the  matrices  A \  6* 
and  c*  by  (2-28). 

It  is  clear  that  the  matrix  A  is  block  diagonal  and  each  submatrix  is  of  size 
2x2.  Therefore,  from  equation  (2-29),  a  high  order  filter  can  be  synthesized, 
once  a  realization  of  a  second  order  filter  in  state  space  is  developed.  A  minimal 
second  order  filter  design  will  be  discussed  in  the  next  subsection. 

For  the  filter  with  higher  multiplicity  poles,  equation  (2-27)  is  no  longer 
valid.  Some  terms  in  the  summation  must  have  a  denominator  with  degree  of 
higher  than  2.  This  in  turn  will  affect  the  structure  of  matrix  A.  A  can  stay  in  a 
block  diagonal  form,  if  factoring  the  original  transfer  function  into  several  pro¬ 
duct  terms  as  in  equation  (2-10).  Then,  each  term  can  be  independently  real¬ 
ized  with  a  block  diagonal  state  matrix.  The  filter  now  becomes  a  cascade  con¬ 
nection  of  block  filters  with  simple  poles  only. 
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2. 6. 2.2.  Second  Order  Filter  Design 

Suppose  the  ith  term  in  the  summation  of  equation  (2-27)  is  of  the  following 
form, 


H{z)  = 


(2-30) 


2  -J)  2  -JJ 

A  filter  in  state  space  design  can  be  readily  obtained  by  defining  the  coefficients 


as 


A  - 


p  0 

0  P* 


b  - 


bi 

b2 


c=[clt  cz]  (2-31) 

where  and  6zca=r*.  This  can  be  easily  verified  by  substituting  the 

coefficients  in  (2-31)  into  (2-26)  and  comparing  with  (2-30).  Actually,  depending 

on  how  we  choose  the  states,  these  equations  can  result  in  infinite  number  of 

realizations  with  different  coefficients.  One  useful  example  would  be  making  all 

coefficients  real  numbers. 

Real  matrices  would  be  desirable,  since  the  computational  rate  is  lower. 
For  a  complex  state  matrix,  two  complex  multiplications  or  equivalently,  8  real 
multiplications  are  needed,  while  in  the  real  state  matrix  case,  only  4  multiplica¬ 
tions  are  required.  Theoretically,  a  real  state  equation  is  obtainable,  since  the 
filter  itself  is  real. 

Before  showing  how  this  can  be  achieved,  let  us  state  a  useful  theorem  as 
follows 

Theorem.  2-1  -  Equivalent  Realization 

Given  a  realization  (A,  b,  c,  d)  for  an  N**  order  filter,  define  r  =  Tr,  where  T 
is  an  N*N  nonsingular  matrix.  Then  ( TAT~X ,  Tb,  cT~l,  d)  realizes  the  same 
filter  and  the  state  equations  become 


FH*!  TAT  +  Tbxn 
yn  =  c  f“lrn  +  d=n 
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If  set  the  nonsingular  matrix  T  to  be 


rf-J 


the  transformed  system  of  (2-31)  can  be  real,  if  6lt  62  and  clt  ce  are  complex 
conjugate  pairs.  The  new  set  of  coefficients  will  be 

7?e(p)  Im(p)  Re(bi) 

*  ~  -Im(p)  Re(p)  b  ~  -I m(b} ) 

c  =  2^c  (c  x).  d  -  d 

Obviously,  the  number  of  nonsingular  matrices  T  is  infinite;  hence,  there 
are  infinite  ways  to  choose  matrices  A,  b  and  c.  A  low  roundoff  noise  filter  can  be 
achieved  without  increasing  computation  by  chocsing  an  optimal  matrix  T. 
Jackson  et  al.,[l7]  derived  an  optimal  second  order  filter  in  state  space  design, 
and  Bames[l3. 19. 20, 21]  showed  another  realization  called  the  normal  filter 
which  has  low  noise  and  is  free  of  autonomous  overflow  limit  cycles.  These  two 
realizations  will  be  derived  in  Chapter  5. 

2.6.3.  Multi-Input  Multi-Output  Filter 

In  single-input  single-output  filter  design,  the  partial  factional  expansion  of 
the  original  transfer  function  is  usually  obtained  before  realizing  the  filter  in 
state  space.  However,  for  a  block  filter  the  transfer  function  is  in  a  matrix  form 
and  usually  not  easy  to  obtain.  Barnes  and  Shinnaka[22]  derived  part  of  the 
transfer  function,  from  which  the  difficulty  in  obtaining  the  general  solution  for 
the  transfer  function  is  evident.  Fortunately,  it  is  not  necessary  to  know  the 
transfer  function  to  implement  the  filter.  The  filter  can  be  implemented  from 
the  state  equations  directly,  which  are  fairly  easy  to  obtain. 


2.6.3.I.  Formulation 

Suppose  we  are  given  a  SISO  representation  (TT.b.c.d).  Relating  the  state 
vectors  of  MIMO  and  SISO  as  7?^  =  r^,  a  representation  (A.B.C.D)  in  the  MIMO 
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system  can  be  easily  obtained,  where  the  matrices  are  defined  as  follows 


A  -AL 


B  -  [AL~lb  AL~zb  .  .  .  b] 


C  = 


c 

cA 


cA 


l-l 


0,  i  <; 

d,  i=j 

cAi~*~ib,  i>j 


The  state  equations  can  be  induced  easily  as  follows 

■^»*l  =  r(n  +  l)£ 

=  ^[^(n  +  l)£-2  +  b*(n*l)£-2]  +  b”(n4-l)£-l 
=  ^(n+lil-g  +  ^r(n  +  l)£-2  +  bx'ni-l)L-l 


AirnL+AL-1bzrir+...  +  Abzrn4.j)i~2  +  6~(n  +  l)£-l 

^tlL 


- A lRn  +  [s£-‘b.A1-2b,--f6 


F(n  +l)L-l 


(2-32) 


=  ARn+BXn 


(2-33) 
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For  the  output  equation,  each  output  sample  in  one  block  can  be  calculated  as 

Vnl  =  CTnl  +  dSni  =  cRn  +  dXnL 
Vnl  ♦  1  c/  rv£  ♦  1  ^  bx^  4-1“  cAr ^  +  C  bz nl  +  b^7iC  ♦  1 
=  cARn  +  cbZni  +  bZni 


I/(n+i)£-i  =  cA L~1Rn  +  cAL~2zni  +  ...  +  cbr(n+1)£_2  +  dr(n+i)£-i 
Expressing  the  above  equations  in  a  matrix  form,  we  get 


Yn  =  CRn  +  DXn 


(2-34) 
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2. 6.3.2.  Some  Important  Properties 

Bames[22]  proved  some  theorems  about  block  state  representation,  some 
of  which  are  important  to  us  and  are  stated  in  this  section.  In  the  following  dis¬ 
cussion,  denote  a  state-space  realization  of  the  scalar  system  by  (A.b.c.d),  and 
its  corresponding  fcCMO  system  related  by  equation  (2-32).  is  denoted  by 
(A.B.C.D).  We  also  denote  the  mapping  between  these  two  systems  by  (.3,b,c.d)  -» 
(A.B.C.D). 

Theorem  2-2.  Irreducibility  Invariance: 

(A.B.C.D)  is  an  irreducible  realization  iff  (.fl.b.c.d)  is  an  irreducible  realization. 
Theorem  2-3.  Equivalence  Invariance: 

Cl7?7-l,Tb.cr-l.d)  -  (TArl,TB,Cr-l.D)  iff  (J.b.c.d)  -  (A.  B.  C.  D).  where  T  is  any 
NxN  invertible  matrix. 

Theorem  2-4.  Completeness: 

If  (A.  B.  C.  D)  is  an  irreducible  realization  of  a  block-shift  invariant  MIMO  system, 
then  there  exists  a  unique  scalar  system  realization,  (A.  b.  c.  d)  such  that  (A,  b, 
c.  d)  -  (A,  B.  C.  D). 

2.6.4.  Block  Cascade  and  Parallel  Forms 

In  scalar  input  scalar  output  filter  design,  the  most  popular  design 
approach  is  to  decompose  the  original  filter  equation  into  a  cascade  or  a  parallel 
connection  of  second  order  filters.  This  idea  can  also  be  applied  to  the  block 
filter  design  All  the  second  order  sections  can  be  processed  in  a  parallel  or  a 
pipelined  fashion.  The  block  state  filter  design  then  reduces  to  a  block  second 
order  filter  design.  Tor  a  second  order  filter,  the  state  matrix  A  is  always  of  size 
2x2  which  greatly  simplifies  the  filter  design  process. 

Zeman  and  Lindgren[23]  proposed  a  inner  product  arithmetic  unit  which 
can  compute  the  matrix-vector  multiplications  in  the  block  state  equations. 
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They  also  combine  the  state  and  output  vectors  into  one  larger  vector  and  the 
equation  of  a  second  order  filter  becomes 


A  B 

Rn 

Yn 

~  lC  D, 

Xn 

where  A,  B,  C  and  D  are,  respectively,  2x2,  Lx2,  2xL  and  LxL  constant  matrices. 
Then,  the  next  state  and  current  output  are  obtained  by  an  inner  product 
hardware  circuit.  For  each  second  order  filter,  there  are  2+L  inner  products, 
each  one  with  size  2+L 

2.7.  Conclusions 

Several  parallel  algorithms  for  an  I1R  filter  realization  have  been  shown  in 
this  chapter.  For  the  S1S0  filters,  the  cascade  and  parallel  forms  are  well-known 
structures  and  easy  to  realize.  The  systolic  array  approach  is  a  very  efficient 
structure  due  to  its  two  way  pipelining  of  the  input  and  output  sequences.  It  can 
also  be  easily  expanded  to  any  order  by  appending  or  deleting  cells  at  the  end. 
Barnwell's  filter  is  of  SIMD  mode  and  hence  very  easy  to  realize.  Only  one  set  of 
program  is  needed  to  realize  the  whole  filter.  The  advantage  of  the  latter  two 
structure  is  that  they  can  be  directly  realized  from  the  difference  equation  and 
no  factoring  or  partial  fractional  expansion  is  required.  On  the  other  hand,  the 
roundoff  noise  behavior  is  equivalent  to  that  of  the  direct  form  implementation, 
which  would  be  a  serious  problem. 

For  block  filters,  the  I/O  and  state  space  formulations  are  shown.  Block 
filters  are  extremely  suitable  for  VLSI  processing  due  to  their  increased  parallel¬ 
ism  over  SISO  filters.  A  very  efficient  structure  in  block  state  space  design  will 
be  shown  in  the  next  chapter. 
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An  extremely  efficient  structure  for  the  block  filter  design  will  be  shown  in 
this  chapter.  This  structure  can  achieve  any  desired  speed  at  the  expense  of 
increased  overall  filter  latency  and  hardware.  The  latency  is  defined  as  the  time 
for  a  sample  to  pass  through  the  filter  from  the  input  to  the  output  Further¬ 
more.  this  high  speed  can  be  achieved  with  a  structure  which  requires  only 
localized  communication.  This  structure  is  realized  with  the  technique  of  block 
processing  described  in  Chapter  2.  As  will  see  later,  block  state  filters  have  a 
more  efficient  structure  than  the  I/O  filters;  hence,  a  detailed  discussion  of  the 
state  filters  will  be  given  in  this  chapter. 

For  the  implementation  of  block  filters,  the  filter  equation  alone  does  not 
specify  the  operation  assignment  and  the  interconnection  among  PE’s,  whereas 
for  the  systolic  or  SSIMD  filter  design  cases,  the  filter  equation  completely 
specify  the  actual  function  of  each  PE.  For  the  systolic  approach,  even  the 
structure  is  completely  defined  by  the  filter  equation,  whereas  for  Barnwell’s 
filter,  once  the  processor  number  is  given  along  with  the  equation,  the  actual 
interconnection  is  immediately  available.  However,  given  a  filter  equation  in  a 
block  form,  there  is  still  a  lot  of  freedom  that  we  have  to  implement  the  filter. 
The  ultimate  goal  is  to  find  a  structure  which  can  sustain  high  throughput  rate 
while  keeping  the  interprocessor  communication  localized 

3.1.  FIR  Filter  Design 

FIR  filters  are  easy  to  be  pipelined  or  to  be  put  in  parallel  to  achieve  very 
high  speed.  Both  pipelining  and  parallel  forms  can  be  easily  extended  to  any 
size.  Consequently,  the  overall  speed  can  also  be  easily  increased  to  any  value. 
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We  will  show  in  this  section  how  the  speed  of  an  FIR  filter  can  be  increased  by 
utilizing  concurrency.  The  parallel  realization  of  1IR  filters  will  be  discussed  in 
the  following  sections. 


3.1.1.  Formulation 

Since  the  feedforward  computation  of  the  three  matrices  B.  C  and  D  in  the 
block  state  space  design  is  very  similar  to  the  FIR  filter  operation,  knowledge 
about  the  FIR  filter  implementation  would  help  us  design  the  block  state  filters. 
As  mentioned  before,  the  implementation  of  an  FIR  filter  is  equivalent  to  the 
realization  of  a  finite  convolution  of  two  sequences.  The  finite  convolution  is  a 
scalar  operation,  which  can  be  represented  by  the  following  equation 


Vn  =  L  °kzn-k 
*«0 


This  is  equivalent  to  setting  N  to  zero  in  equation  (2-1). 

As  mentioned  earlier,  the  FIR  filters  can  also  be  represented  as  an  MIMO 
system  which  has  the  following  equation 


Yn  =  AXn  +  BXn-t  (3-2) 

where  A  and  B  are  equivalent  to  the  matrices  Aq  and  At  in  (2-17)  respectively. 

The  input  and  output  vectors.  Xn  and  are  defined  as  in  (2-3).  Equation  (3-2) 
is  true  only  if  the  block  length  L  is  not  smaller  than  the  filter  order  M.  Other¬ 
wise,  more  matrix-vector  multiplying  terms  will  be  summed  up  in  (3-2).  Actu¬ 
ally.  when  L  reduces  to  1.  equation  (3-2)  becomes  a  scalar  equation  as  (3-1). 

Combining  AJ,  and  into  one  large  vector,  (3-2)  can  be  transformed  into 
an  equation,  which  consists  of  a  single  matrix-vector  multiplication 
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Since  the  left  LrM  columns  of  matrix  B  are  all  null  vectors  (See  A\  in  (2-17)),  the 
composite  matrix  can  be  reduced  to  another  matrix  A'  of  size  L  by  L+M.  and 
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only  the  last  M  samples  in  Xn^  are  needed.  If  define  the  new  composite  vector 
as  Xn,  (3-3)  can  be  written  as 

Yn  —  A'X'n 

From  the  definition  of  matrices  B  and  A.  it  is  clear  that  matrix  A'  is  a  banded 
matrix.  This  matrix  is  shown  in  (3-4)  for  the  case  of  M=3  and  L=4. 

Xl 
Ym 
Yna 

Yny 

Since  the  first  three  entire  in  Xn  are  in  Xn-i  and  the  others  are  in  Xn,  it  is  obvi¬ 
ous  that  the  first  three  samples  are  equal  to  the  last  three  samples  in  the  previ¬ 
ous  cycle. 

3.1.2.  Implementations 

The  direct  form  implementation  of  equation  (3-1).  which  consists  of  several 
delay  lines,  multipliers  and  a  large  adder,  is  shown  in  Figure  3-1.  The 
throughput  of  this  structure  is  fixed  for  a  given  chip  speed.  The  only  way  to 
increase  the  throughput ,  as  suggested  in  Figure  1-2,  is  to  duplicate  this  section 
so  that  a  number  of  sections  can  simultaneously  work  on  successive  output  sam¬ 
ples.  The  accompanied  problem  is  the  extremely  complex  data  flow  pattern 
from  the  data  source  to  ail  the  sections.  A  even  more  serious  problem  would  be 
the  inefficiiincy  of  the  hardware  usage.  With  a  single  section,  the  output 
sequence  can  keep  flowing  out  of  the  structure  by  utilizing  the  stored  data  on 
the  delay  lines.  Thus,  each  input  sample  is  used  M+l  times,  while  staying  on  the 
delay  line,  to  generate  M+l  different  output  samples.  If  more  sections  are  dupli¬ 
cated,  the  time  gap  between  two  adjacent  output  samples  from  each  section  is 
longer.  Hence,  each  input  sample  is  used  less  often.  This  inefficiency  implies 
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that  more  internal  memory  space  is  required  and  each  input  sample  has  to  be 
duplicated  for  all  the  affected  sections. 

Xung[S]  devised  several  systolic  arrays  for  implementing  a  finite  convolu¬ 
tion.  which  can  be  used  to  realize  an  FIR  filter.  One  of  these  arrays  is  drawn  in 
figure  3-2. 


Figure  3-2  linear  Systolic  Array  for  the  Finite  Convolution 

The  coefficients  stay  in  the  cells,  whereas  the  input  and  output  sequences  are 
traveling  in  opposite  directions  at  the  same  speed.  Consecutive  V*  and  Vi’*  •« 
separated  by  two  clock  cycles  to  ensure  that  each  s*  is  able  to  meet  every 
Obviously,  this  array  is  capable  of  outputting  a  every  two  cycles;  hence,  only 
one-helf  the  cells  work  at  any  given  time.  This  structure,  although  very  simple, 
also  suffers  from  the  problems  of  limited  speed  and  inefficiency  if  more  sections 
are  duplicated. 

In  contrast  to  S130  processing,  block  processing  can  easily  increase  the 
throughput  rate  by  increasing  the  size  of  vector  Xn.  Although  Xung’s  linear 
array  for  matrix-vector  multiplication^]  is  more  efficient  for  a  banded  matrix, 
the  limitation  on  speed  is  still  a  problem.  If  putting  more  such  arrays  in  paral¬ 
lel,  the  data  distribution  is  still  very  complex.  A  very  efficient  structure  for 
implementing  (3-2}  or  (3-3)  will  be  shown  in  this  section.  This  structure  has  a 
very  regular  Interconnection,  and  hence  it  can  be  easily  expanded  without  com¬ 
plicating  the  communication  environment.  Furthermore,  all  the  cells  can  still 


be  efficiently  exploited  after  the  structure  is  expanded.  This  structure  utilizes 
the  idea  of  the  systolic  array  but  with  a  little  modification.  In  order  to  process 
the  input  vectors  continuously,  a  two-dimensional  array  rather  than  a  linear  one 
is  used. 

Figure  3-3a  shows  the  basic  function  of  one  cell,  and  3-3b  show's  the  internal 
structure  of  such  cell.  The  two  latches  in  Figure  3-3b  ensure  that  all  the  data 
transfer  can  be  done  synchronously.  The  interconnection  of  such  cells  to  imple¬ 
ment  (3-4)  is  illustrated  in  Figure  3-4.  The  two  dimensional  array  in  Figure  3-4 
accepts  input  vectors  from  the  top  and  outputs  vectors  to  its  right.  The  cells  on 
the  i 01  row  calculate  the  i**  output  sample  in  the  vector  Yn.  On  the  vertical 
transmission,  no  computation  is  involved;  hence,  a  short  bus  can  be  used  to  send 
data  to  all  the  cells  on  the  same  column.  After  finishing  its  job,  each  cell  sends 
its  updated  output  to  the  right  cell.  The  rightmost  cell  sends  out  the  final  result 
calculated  by  the  row  that  this  cell  is  located. 

In  Figure  3-4,  each  row  is  equivalent  to  the  realization  of  an  SISO  FIR  filter. 
All  the  rows  work  simultaneously  at  the  same  speed  but  with  a  fixed  time  skew 
on  the  starting  time.  If  the  two  latches  are  controlled  by  a  single  clock,  this 
time  skew  equals  the  execution  time  of  one  cell.  Thus,  the  number  of  rows  is 
equivalent  to  the  number  of  FIR  filters  working  independently.  Obviously,  the 
overall  throughput  rate  depends  on  the  number  of  rows  in  the  array.  If  neglect¬ 
ing  the  data  transmission  overhead,  the  throughput  rate  is  L  times  higher  than  a 
single  FIR  filter  for  a  vector  of  size  L  Therefore,  the  throughput  rate  can  be 
easily  Increased  by  simply  appending  more  rows  to  this  array.  There  is  no  limit 
on  the  numuei  of  rows  that  can  be  appended;  hence,  this  structure  can  achieve 
any  arbitrary  speed  with  a  simple  expansion.  The  only  thing  that  has  to  be 
taken  care  of  is  the  data  distributor  which  distributes  the  input  data  sequence 
to  the  right  column.  Since  some  samples  are  used  in  two  adjacent  vectors  at 


different  clock  cycles,  care  must  be  taken  in  the  data  distributor  design  to 
reflect  the  actual  data  flow.  However,  this  data  distributor  is  much  simpler  than 
that  in  Figure  1-2.  This  is  true  due  to  the  fact  that  each  input  sample  goes 
through  a  fixed  path  whereas  in  Figure  1-2,  ercb  sample  has  to  be  sent  to  M+l 
different  sections  for  the  computation  of  M+l  output  samples. 

If  a  bus  is  used  for  the  vertical  transmission  on  each  column,  the  starting 
time  of  two  adjacent  rows  are  separated  by  the  input  sampling  period  rather 
than  by  the  execution  time  of  each  cell.  The  latter  is  usually  much  longer  than 
the  former.  If  tins  array  functions  synchronously,  a  latch  must  be  inserted 
between  any  two  adjacent  cells  on  the  same  column  as  well  as  the  places 
corresponding  to  the  upper  right  zero's  in  matrix  A'.  The  output  samples  will 
then  be  skewed  by  the  execution  time  of  one  cell.  Thus,  although  the 
throughput  stays  unchanged,  the  response  time,  or  filter  latency,  increases. 

3.2.  Beal  Time  Constraint  for  the  HR  niter  Design 

Conventional  wisdom  would  say  that  unlike  the  FIR  filter,  for  a  given  speed 
of  hardware  the  sampling  rate  of  the  IIR  filter  is  limited  by  the  feedback 
inherent  in  the  recursion.  It  will  now  be  shown  that  this  is  not  the  case,  and 
specifically  that  an  architecture  with  unlimited  speed  (similar  to  the  FIR  filter) 
can  be  defined.  This  architecture  of  course  exploits  parallelism,  and  can  be 
implemented  in  a  fashion  which  requires  only  local  communication.  The 
representation  we  will  describe  is  based  on  a  block  state  realization  of  the  IIR 
filter. 

Block  filters,  whether  I/O  or  state  space,  consist  of  several  matrix-vector 
multiplications.  From  Figures  2-8  and  2-9,  It  is  clear  that  among  these  multipli¬ 
cations.  only  one  term  in  each  graph  operates  recursively.  Since  this  feedback 
operation  can  not  be  pipelined  and  has  to  finish  in  one  block  period,  the  overall 
throughput  rate  depends  solely  on  the  Implementation  of  this  feedback  term. 


The  block  period  is  defined  as  the  time  needed  to  accumulate  ail  the  samples  in 
the  buffer.  Hence,  this  period  is  L Tt,  where  L  is  the  buffer  size  and  T,  is  the 
scalar  input  sampling  period.  As  for  the  other  matrices,  the  only  constraint  is 
that  the  operation  time  for  each  input  vector  can  not  exceed  the  block  period. 
An  arbitrary  number  of  delays  can  be  inserted  on  the  path  computing  these 
feedforward  matrices  without  affecting  the  overall  throughput  rate.  Hence,  the 
parallel  implementation  of  the  feedback  matrix  will  be  discussed  first  and  the 
implementation  of  the  other  matrices  will  be  treated  afterwards. 

For  a  minimal  realization  of  a  block  state  filter,  the  size  of  the  feedback 
state  matrix  equals  the  filter  order  N.  which  is  fixed  for  a  given  filter  indepen¬ 
dent  of  block  size  L  Certainly,  the  value  of  each  entry  does  change  with  the 
block  size;  however,  the  number  of  entries  stays  the  same.  Therefore,  the  exe¬ 
cution  time  of  this  matrix  depends  only  on  the  filter  order  and  the  processor 
speed  but  not  on  the  block  size,  if  a  single  processor  is  used  for  computing  this 
feedback  matrix.  Suppose  the  time  to  perform  a  multiplication  and  an  addition 
is  fm  and  respectively,  and  the  input  sampling  period  is  7*,.  The  execution 
time  of  a  full  state  matrix  is  where  N  is  the  order  of  the  filter.  The 

minimum  block  size  that  guarantees  real  time  is  the  smallest  integer  which 
satisfies  the  following  condition 


Nz(ta+tn).  Thus,  the  vector  throughput  rate,  which  is  defined  as  the  processing 
speed  of  one  vector,  is  fixed.  The  scalar  throughput  rate,  which  is  L  times  that 
of  the  vector  through  rate,  can  be  easily  increased  to  any  value  by  increasing 
the  block  size. 


Although  any  desired  speed  can  be  achieved  according  to  equation  (3-5), 
the  large  value  of  L  might  restrict  the  applicability  of  this  filter  structure,  since 
a  large  L  implies  a  long  filter  latency,  as  well  as  higher  computational  rate.  A 
higher  computational  rate  implies  more  hardware  required.  The  size  of  the 
hardware  may  look  unrealistic  when  L  is  very  large.  Furthermore,  no  real  time 
signal  processing  systems  can  tolerate  infinite  delay.  Hence,  if  the  overall 
latency  is  untolerable,  higher  speed  chips  are  required  to  reduce  the  block  size 
and  hence  the  filter  latency.  Since  L  is  directly  related  to  the  filter  latency,  it  is 
always  beneficial  to  reduce  this  number  while  keeping  the  original  speed. 

For  the  block  diagonal  state  matrix  case,  each  submatrix  on  the  diagonal  is 
a  2x2  matrix  and  is  decoupled  from  any  other  submatrices.  Therefore,  all  the 
submatrices  can  be  processed  in  parallel  without  communicating  with  one 
another.  Hence,  the  execution  time  of  the  state  matrix  reduces  to  that  of  a  2x2 
matrix.  This  execution  time,  which  depends  only  on  the  chip  speed  but  not  on 
the  filter  order,  equals  4(fm+f*).  Equation  (3-5)  can  then  be  modified  as  follows 


l »  <M) 

For  a  high  order  filter,  the  value  L  can  be  greatly  reduced  compared  to  equation 
(3-5).  For  a  filter  with  complex  poles  only,  all  the  submatrices  have  exactly  the 
seme  execution  time;  hence,  no  idle  time  is  present  in  any  cell  similar  to  the 
pipelining  of  second  order  filters.  The  overall  speed  of  a  high  order  filter  then 
depends  on  that  of  a  second  order  section  and  not  on  the  filter  order. 


In  the  following  sections,  we  will  discuss  the  implementation  of  a  block  filter 
with  block  diagonal  state  matrix  only.  Even  though  the  block  size  and  the 
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implementation  of  the  state  matrix  are  fixed,  good  implementations  of  the  other 
three  matrices  B,  C  and  D  deserve  further  investigation.  We  will  show  how  an 
implementation  with  only  localized  communication  can  be  achieved. 

3.3.  Direct  Implementation 

A  naive  way  to  implement  the  three  matrix-vector  multiplications  is  to  treat 
each  of  them  as  several  inner  products  and  then  transmit  the  outcomes  to  the 
appropriate  places.  Hence,  a  tree  structure  inner  product  structure  as  shown  in 
Figure  3-5  is  a  natural  selection.  Each  cell  in  this  structure  contains  either  a 
multiplication  or  an  addition.  Since  the  execution  time  of  each  cell  is  much 
shorter  than  that  of  one  submatrix  in  A.  several  cells  can  be  combined  into  one 
processor.  However,  it  is  very  unlikely  that  we  can  have  an  assignment  such 
that  all  the  cells,  including  those  working  on  the  feedback  matrix,  have  exactly 
the  same  execution  time.  Hence,  waiting  time  for  some  cells  is  unavoidable,  and 
this  results  in  requiring  more  hardware. 

Another  problem  arising  from  this  structure  is  the  complexity  of  Interpro- 
cessor  communication.  First,  the  input  samples  Xj  are  sent  to  all  rows  of 
matrices  B  and  D.  This  requires  a  broadcasting  type  of  communication  link 
between  the  data  source  and  the  matrices  operating  on  the  input  sequence. 
Second,  the  output  data  from  the  st».te  matrix  has  to  be  sent  to  matrix  C.  If 
implementing  C  as  inner  products,  send  each  output  sample  from  matrix  A  is 
sent  to  every  inner  product  structure.  This  also  results  in  a  broadcasting  type 
of  transmission 

This  rather  complex  communication  pattern  usually  requires  some  dedi¬ 
cated  chips  to  perform  the  data  switching.  These  extra  switching  processors 
along  with  the  inevitable  processor  idle  time  make  this  direct  implementation 
very  undesirable.  Furthermore,  the  complex  switching  network  might  cause  a 
serious  delay  for  each  message  transmission.  Although  this  delay  does  not 


figure  3-5  Tree  Structure  for  Realizing  Inner  Product 
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affect  the  overall  throughput  rate,  since  the  delay  Is  on  the  feedforward  path,  it 
would  surely  increase  the  overall  filter  latency. 

Figure  3-6  shows  the  interconnection  of  PE's  for  a  six  order  filter  with  sim¬ 
ple  complex  poles  only.  Cells  1  to  6  calculate  BXn,  with  each  row  of  matrix  B 
assigned  to  one  cell.  Cells  7.  8  and  9  perform  the  state  feedback  operation  with 
each  one  containing  a  2x2  matrix.  The  rest  cells  perform  the  calculation  of  the 
composite  matrices  of  C  and  D.  In  this  specific  example,  the  first  row  of  matrix 
contains  three  l's  and  three  0's,  and  hence  no  multiplications  are  involved.  Cell 
10  works  on  the  first  output  sample  and  the  other  cells  work  on  the  other  two 
output  samples  in  two  cascading  stages.  Figure  3-7  shows  the  implementation  of 
the  same  filter  with  block  size  7  in  the  I/O  structure.  The  block  size  has  to  be  at 
least  6  so  as  to  have  only  three  multiplying  terms.  The  cells  on  the  first  column 
calculate  the  feedforward  operations.  The  odd  number  cells  calculate  Bq  'AqXh 
and  the  even  number  cells  calculate  Bo'AiXn-i'  Each  cell  performs  an  inner 
product  of  the  product  matrices.  The  cells  on  the  second  column  calculate  the 
feedback  operation  B£lB\Yn-\.  Notice  that  the  first  cell  on  this  column  does 
not  send  its  output  to  the  other  cells  for  the  calculation  in  the  next  cycle,  since 
only  the  lower  six  cells  operate  recursively.  Notice  the  extremely  complex  com¬ 
munication  links  of  the  feedback  matrices  in  both  cases.  We  expect  a  serious 
effect  on  the  overall  speed  resulted  from  this  complex  communication  links,  and 
will  discuss  it  later. 

3.4.  Systolic  Arrays 

As  mentioned  in  section  2-3,  systolic  arrays  make  localized  communication 
possible  while  attaining  high  throughput  If  we  can  utilize  the  systolic  idea  for 
realizing  the  operations  of  matrices  B,  C  and  D,  the  overall  structure  will  be  reg¬ 
ular  and  have  only  localized  communication.  Since  global  communication  is 


very  expensive  in  VLSI  whether  it  is  on  chip  or  off  chip,  the  localized 
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Figure  3-6  A  6**  Order  Block  State  Filter  Structure 


c o rra ur.i c at io n  makes  tbe  systolic  array  very  attractive  in  VLSI  applications. 

To  implement  the  matrix-vector  multiplication,  Kur.g[S]  showed  a  linear 
systolic  array  which  can  be  applied  to  the  implementation  of  matrices  B.  C  and 
D.  However,  he  assumed  that  this  multiplication  is  done  only  once,  while  in  digi¬ 
tal  filter  implementation,  the  same  matrix  multiplies  essentially  an  infinite 
number  of  vectors.  Unfortunately,  it  is  not  easy  to  feed  vectors  continuously 
into  Kung's  linear  array  without  complicating  the  cell’s  function.  Furthermore, 
the  scalar  throughput  rate  is  fixed  with  a  given  cell  speed.  Therefore,  it  takes 
more  time  to  generate  a  long  vector  than  a  short  vector.  Hence,  direct  applica¬ 
tion  of  this  linear  array  cannot  meet  the  real  time  requirement  and  some 
modification  has  to  be  made. 

One  way  to  modify  this  architecture  is  to  have  several  linear  arrays  working 
on  successive  input  blocks  in  parallel.  However,  the  data  flow  pattern  will  be 
very  complicated;  hence,  this  will  destroy  the  simplicity  of  the  systolic  array, 
which  is  the  vital  part  of  the  systolic  approach.  Another  method  of  modification 
is  to  exploit  Kung’s  hexagonal  structure  for  matrix-matrix  multiplication. 
Several  input  vectors  can  be  fed  into  this  array  at  the  same  time  as  a  single 
matrix.  However,  this  structure  is  designed  for  performing  the  matrix  multipli¬ 
cation  only  once,  which  is  very  common  in  most  structures  in  computer  science 
area,  and  hence  the  continuous  usage  of  this  structure  for  a  large  number  of 
input  vectors  is  not  considered. 

3.4.1.  Implementation  of  Block  State  niters 

3.4.  l.l.  Matrix-Vector  Multiplication 

A  very  efficient  structure  for  realizing  the  block  state  filter  can  be  con¬ 
structed  by  Interconnecting  cells  which  have  an  internal  structure  as  shown  in 
Figures  3-3.  Kung's  hexagonal  two  dimensional  array  looks  unnatural  for  a  digi- 
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tal  filter  implementation,  since  both  coefficients  and  input  data  travel  through 
this  array.  In  digital  filter  design,  however,  the  same  set  of  coefficients  is  going 
to  multiply  a  large  number  of  input  samples;  hence,  it  is  more  reasonable  to 
keep  the  coefficients  fixed  in  cells,  thereby  eliminating  an  unnecessary  (and 
expensive)  communication. 


Before  showing  how  to  realize  a  block  state  filter,  we  will  demonstrate  how 
these  cells  can  be  applied  to  multiplying  a  3x4  matrix  by  vectors.  This  multipli¬ 
cation  at  time  n  can  be  expressed  as  in  equation  (3-7) 


all  a12  a13  a14  Xn 2  *nA 

a2]  a22  a23  a24  y  =  Fn.2  (3-7) 

,2 

a3l  a32  °33  °34  Y  .  ^n.3 

where  Xn.i  is  the  i”1  component  of  vector  Xn.  Figure  3-3  shows  the  structure  of 


an  array  to  implement  this  operation.  The  coefficient  stored  in  each  cell  is  the 


value  in  the  corresponding  position  of  the  matrix. 


At  time  n,  the  input  vector  Xn  enters  this  array  from  the  top  with  each 
component  aligned  with  its  corresponding  column.  Each  element  in  this  input 
vector  is  actually  skewed  in  time,  and  this  time  skew  also  happens  to  the  output 
samples.  A  commutator  model  at  this  input  is  appropriate  to  represent  the 
input  data  flow.  The  speed  of  this  commutator  is  synchronized  to  the  input  sam¬ 
pling  rate,  and  the  same  commutator  can  be  used  at  the  output  to  convert  the 
output  from  vectors  to  scalars.  Latches  are  inserted  on  some  input  paths  to 
compensate  the  time  skew  among  the  cells  on  the  first  row  in  the  array.  Except 
the  cells  on  the  bottom  row,  each  cell  transmits  its  upper  input  to  the  cell  below 
directly  without  any  operation.  It  also  updates  its  left  input  according  to  the 
function  shown  in  Fig.  3-3a,  and  then  sends  the  updated  value  to  its  right  cell. 
Hence,  each  cell,  in  addition  to  transmitting  input  data  to  its  lower  neighbor, 
performs  one  step  of  updating  of  the  inner  product  between  the  input  vector 
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and  its  corresponding  row  vector  in  the  matrix.  After  the  cells  on  the  rightmost 
column  finish  their  updating,  their  right  outputs  constitute  the  final  output  vec¬ 
tor  Yn. 

The  basic  principle  behind  the  use  of  this  structure  is  that  when  the  upper 
left  cell  finishes  its  computation  on  the  n.th  input  vector  and  sends  the  data  to 
its  two  neighbors,  this  cell  can  work  on  the  (n  +  l)ih  vector.  In  the  meantime,  its 
two  neighboring  cells  can  start  computing  the  n.t<l  vector.  Thus,  the  computa¬ 
tion  time  of  Xn  forms  a  ''wavefront"  propagating  through  the  array  from  upper 
left  to  lower  right  When  the  wavefront  reaches  the  right  end.  we  can  obtain  the 
final  output  vector.  This  phenomenon  looks  very  similar  to  S.  Y.  Kung's  "Wave- 
front  Array  Processor"[24],  except  that  the  data  transmission  is  unidirectional 
and  only  the  data  propagate  through  the  array  while  the  coefficients  stay  in  the 
cells.  Although  the  three  components  of  the  output  vector  Yn  arrive  at  the  out¬ 
put  ports  at  different  times,  the  average  time  to  generate  one  output  vector  is 
the  execution  time  of  one  cell.  This  is  true  because  the  next  vector.  is  on 
the  next  wavefront  which  lags  the  previous  wavefront  by  the  execution  time  of 
one  cell.  For  the  vertical  transmission  of  the  input  data,  the  time  can  be  grr~tly 
reduced,  since  the  operation  is  simply  transmitting  data  downward  and  no  com¬ 
putation  is  required. 

3.4. 1.2.  Filter  Implementation 

For  the  matrix-vector  multiplication  structure  described  in  the  previous 
subsection,  the  vector  throughput  rate  is  fixed  and  depends  solely  on  the  cell 
speed.  This  feature  meets  the  requirement  of  a  real  time  block  filter  and  hence 
is  very  helpful  in  realizing  the  filter. 

Figure  3-9  shows  the  block  diagram  of  a  block  state  filter.  The  small 
squares  labeled  A  represent  the  computation  of  the  feedback  state  matrix.  The 
three  big  rectangles  B,  C  and  D  perform  the  matrix-vector  multiplications.  The 
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Figure  3-9  Two  Dimensional  Systolic  Array  for  Block  State  Filter  Idealization 


structure  of  these  three  rectangles  is  similar  to  that  in  Figure  3-B,  but  with 
different  sizes.  Matrices  B  and  D  are  initialized  to  zero  at  their  y  inputs,  and 
matrix  C  is  initialized  by  the  output  from  matrix  D. 

In  matrix  B.  the  input  vectors  are  transmitted  vertically  and  their  output 
samples  are  propagated  horizontally  to  its  right  These  output  samples  are  used 
by  matrix  A  to  calculate  the  next  state  vector  which  in  turn,  will  be  used 
by  matrix  C  to  calculate  the  final  outputs.  As  for  matrices  C  and  D.  the  input 
samples  Rn  and  Xn  are  transmitted  horizontally,  and  the  output  are  propagated 
vertically  to  the  bottom. 

If  combining  four  cells  in  Figure  3-S  into  one  FE.  the  new  structure  can  have 
the  same  form  as  before  and  the  execution  time  of  each  PE  will  be  exactly  the 
same  as  that  of  the  PE’s  operating  on  the  feedback  matrix.  Hence,  if  using  the 
same  PE  for  both  the  feedback  and  feedforward  matrices,  the  real  time  condi¬ 
tion  will  be  automatically  satisfied.  Because  the  cells  are  equally  loaded  compu¬ 
tationally.  the  input  data  can  be  kept  pumping  through  the  whole  structure  and 
no  idle  time  is  necessary  anywhere  in  this  structure. 

If  working  synchronously,  the  output  samples  between  two  adjacent  rows 
will  be  separated  by  the  execution  time  of  one  cell.  Since  all  the  cells  that  work 
on  matrix  A  operate  independently,  this  time’skew  on  the  outputs  from  matrix  B 
won’t  affect  their  function.  Thus,  the  output-  samples  from  matrix  A  are  also 
skewed  by  the  same  time  interval.  This  time  skew  matches  perfectly  to  the 
operations  in  matrix  C,  since  the  starting  time  of  the  cells  on  the  same  column 
Is  also  skewed  by  this  time.  This  perfect  match  also  applies  to  the  interface 
between  the  output  from  D  and  input  of  C. 

If  a  vertical  bus  is  used  for  transmitting  the  input  sample  on  each  column  of 
matrix  B.  all  the  output  samples  in  the  same  vector  can  be  generated  at  the 
same  time.  However,  the  time  skew  of  matrix  C  between  cells  on  each  column  is 


still  necessary.  Therefore,  different  number  of  buffers  is  required  to  store  the 
output  samples  from  A  to  reflect  this  time  skew.  On  the  other  hand,  if  a  bus  is 
used  on  each  row  of  both  matrix  C  and  matrix  D.  the  propagation  wavefront  will 
be  horizontal  and  moves  vertically.  Hence,  the  interface  between  C  and  D  is  still 
perfectly  matched  and  output  samples  in  the  same  vector  can  be  generated  at 
the  time. 

As  mentioned  before,  the  throughput  rate  can  be  increased  by  simply 
increasing  the  block  size.  This  increased  block  size  results  in  larger  rectangles 
and  more  cells.  However,  the  speed  of  each  cell  does  not  have  to  be  faster.  For 
matrix  A.  neither  the  number  nor  the  speed  of  cells  has  to  change  to  cope  with 
the  higher  speed.  Thus,  we  can  effectively  trade  hardware  with  speed  without 
difficulty.  And  read  time  processing  is  possible  even  with  low  speed  processors,  if 
using  more  hardware  is  not  a  problem. 

3.5.  Some  Applications 

Besides  recursive  filtering,  the  efficient  structure  of  block  state  filters  can 
also  be  applied  to  some  other  areas.  In  this  section,  the  applications  to  com¬ 
puter  graphics  and  decimation  and  interpolation  are  presented.  The  rotation, 
scaling  and  translation  of  computer  graphs  can  be  represented  by  a  4x4  matrix 
multiplying  the  input  vectors.  An  input  vector  is  composed  of  the  three  coordi¬ 
nates  of  a  point  in  the  graph  plus  an  extra  parameter.  Hence,  the  two  dimen¬ 
sional  systolic  array  for  implementing  a  matrix-vector  multiplication  can  be 
applied  to  the  computer  graphics.  Multistage  FIR  filters  are  considered  as 
efficient  structures  to  realize  the  decimation  as  well  as  interpolation  This  is 
true  due  to  the  fact  that  FIR  filters  are  feedback  free,  and  hence  decimating  the 
output  samples  is  easy.  I1R  filters,  on  the  other  hand,  are  considered  inferior  to 
FIR  filters  due  to  their  feedback  operations.  However,  in  the  block  state  filters, 
the  state  variables  rather  than  the  output  samples  are  fed  back  for  the 
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calculation  in  tbe  next  cycle.  Hence,  the  direct  sampling  rate  reduction  on  the 
output  samples  is  possible.  On  the  other  hand,  the  computation  of  all  the  state 
variables  is  necessary. 

3.5.1.  Computer  Graphics 

Interactive  computer  graphics  becomes  more  and  more  important  in  fields 
like  computer  aided  design,  man/machine  communication  and  character  recog¬ 
nition  etc.  In  some  sense,  computer  graphics  is  very  similar  to  digital  signal 
processing  systems.  Both  have  a  large  number  of  input  samples  to  be  processed 
and  both  require  high  throughput  rate.  The  two-dimensional  systolic  array 
architecture  described  earlier  can  also  be  applied  to  computer  graphics  in  some 
applications,  such  as  graph  rotation,  scaling  and  translation. 

A  three-dimensional  (3-D)  graph  is  a  collection  of  many  points  frith  three 
coordinates.  Hence,  each  point  can  be  represented  by  three  numbers  (x.y,z). 
Rotation  and  scaling  of  a  graph  can  be  represented  by  a  3x3  matrix  multiplying 
all  the  points  in  the  graph[25].  Scaling  a  graph  by  a  constant  S  is  simply  multi¬ 
plying  every  component  by  this  constant.  The  coordinates  after  the  scaling 
(x'.y'.z')  can  be  related  to  the  coordinates  before  scaling  (x,y,z)  by 
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The  rotation  of  a  graph  in  three  dimension  can  be  decomposed  into  a  combina¬ 
tion  of  three  rotations  about  x-,  y-  and  z-axis  respectively.  Rotating  an  angle  iS 
about  the  z-axis  can  be  represented  as 


x' 

V 

s 

z\ 

sind  cosiJ  0 
0  0  1, 


The  matrices  representing  the  rotation  about  x-axis  and  the  y-axis  can  be  writ¬ 
ten  as 
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10  0 
0  costf  -sind 
0  sind  cosd 


cord  0  sir.d 
0  1  0 
-sint?  0  cos*J. 


The  complete  rotation  can  be  represented  by  a  single  3x3  matrix  which  is  the 
product  of  the  above  three  matrices  with  proper  angles.  Hence  the  combination 
of  scaling  and  rotation  can  be  modeled  by  a  full  3x3  matrix  in  general. 

Translation  of  an  object  in  the  three  dimensional  space  by  ’D=(Dg,Dy,Dt)  can 
be  done  by  adding  the  three  components  of  D  to  the  corresponding  component 
of  each  point  of  this  object.  Hence,  the  position  (x',y‘,z‘)  after  translation  can 
be  represented  as 

s'  -  x  +  Dx  y'  -  y  +  Dv  z'  =  z  +  DM 
Unfortunately,  this  operation  cannot  be  represented  by  a  3x3  matrix  multiply¬ 
ing  a  vector  as  in  the  rotation  and  scaling  cases.  However,  introducing  after  z 
an  extra  coordinate,  which  is  always  1.  the  translation  can  be  represented  as 

H 

fy'l  P  1  0  Dv  v 


\z'\  "  0  0  1  23,  z 

N  lo  0  0  l][l 

Adding  one  more  column  and  row  to  the  matrix,  the  rotation  and  scaling  can 
also  be  represented  in  the  same  dimensionality  as  in  the  translation  case.  A 
general  matrix  representation  of  a  combination  of  any  number  rotations  and 
scalings  can  be  written  as 

rn  rl2  rl3  0 
r21  Tzt  ra  0 
rai  ’’sz  ^33  0 
0  0  0  1 

If  also  including  any  number  of  translations,  the  overall  operation  can  be 
represented  as  the  following  matrix 


rll  rlZ  rl3  t* 

r2,  ra  ra  fy 

r31  r32  r33  t* 
0  0  0  1, 


A  highly  pipelining  structure  can  be  obtained  by  using  a  4x4  systolic  array 
as  shown  in  Figure  3-8.  Furthermore,  since  the  last  coordinate  is  always  1  for 
each  point,  the  same  function  can  be  realized  with  a  3x4  systolic  array.  Also 
because  the  last  entry  of  each  row  is  always  multiplied  by  unity,  the  structure 
can  be  reduced  further  to  an  array  of  size  3x3  with  the  three  entries  in  the  last 
column  as  the  initialization  values  entering  this  array  from  the  left.  The  struc¬ 
ture  is  shown  in  Figure  3-10.  The  throughput  rate  of  this  structure  again  is 
governed  by  the  time  for  one  cell  to  perform  one  addition  and  one  multiplica¬ 
tion. 

The  difference  of  this  application  from  the  IIP.  filter  is  that  the  entries  will 
have  different  values  for  different  transformation.  Therefore,  an  efficient  way  to 
adapt  these  coefficients  is  essential.  Furthermore,  multiplying  matrices 
together  for  a  combination  of  rotation,  scaling  and  translation,  requires  matrix- 
matrix  multiplications.  Hence,  another  efficient  structure  for  implementing  this 
operation  is  also  essential.  Thus,  some  further  research  in  this  area  should  be 
done  before  the  systolic  array  can  be  applied  to  the  transformation  of  graphs. 

3.5.2.  Decimation  and  Interpolation 

Decimation  and  interpolation  are  linear  and  periodic  time  varying  systems. 
The  period  depends  on  the  decimation  or  the  interpolation  ratio.  This  is  true 
because  almost  all  signal  processing  systems  are  modeled  as  scalar  input  scalar 
output  systems.  Systems  with  different  sampling  rates  at  the  input  and  output 
cannot  be  represented  as  Linear  Shift  Invariant  (LSI)  SISO  system  functions. 
However,  they  can  be  represented  as  LSI  MIMO  systems  with  a  proper  ratio 
between  the  input  and  the  output  block  sizes[26, 27]. 

Sampling  rate  reduction,  or  decimation,  can  be  realized  in  two  stages. 
First,  pass  the  input  sequence  through  a  low  pass  filter  with  bandwidth  pp 
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Then,  throw  away  M-l  out  of  M  samples,  retaining  the  M:h,  where  M  is  the  deci¬ 


mation  ratio.  The  low  pass  filter  is  used  to  prevent  aliasing  resulted  from  the 


lower  sampling  rats.  Interpolation,  on  the  other  hand,  can  be  realized  by  insert¬ 


ing  1-1  zeros  between  each  adjacent  samples  and  then  passing  this  new  sequence 


through  a  lowpass  filter  with  bandwidth  —  I  is  the  interpolation  ratio  and  the 


lowpass  filter  is  used  to  remove  the  spurious  components  generated  from  the 


interpolation  process.  The  detailed  design  procedure  can  be  found  in[23]. 


The  low  pass  filters  are  usually  realized  with  multistage  FIR  filters[29].  The 


decimation  process  can  be  done  by  calculating  one  out  of  M  output  samples,  if 


an  FIR  filter  is  used.  For  UR  filters,  on  the  other  hand,  each  output  depends  on 


the  previous  N  samples,  where  N  is  the  order  of  the  filter.  Therefore,  in  order  to 


calculate  the  M*h  output  sample,  the  previous  samples  also  have  to  be  com¬ 


puted:  hence,  very  little  saving  in  computation  can  be  achieved.  Although  IIR 


filters  require  much  less  computation  than  their  equivalent  FIR  filters,  the 


recursive  operation  makes  them  inferior  to  FIR  filters  in  the  decimation  and 


interpolation  applications. 


Block  state  filters  on  the  other  hand,  allow  us  to  decimate  the  output  sam¬ 


ples  directly,  since  no  feedback  is  occurred  to  the  output  samples.  All  the  out¬ 


put  samples  in  a  block  are  generated  from  the  same  state  vector.  The  state  vec¬ 


tor  is  fed  back  for  the  next  cycle  calculation.  Therefore,  the  computation  of  the 


state  variables  cannot  be  reduced.  Even  if  the  block  state  filters  have  a  little 


higher  computational  rate,  the  simplicity  in  the  implementation  of  their  struc¬ 


tures  makes  them  still  very  attractive. 


Suppose  the  low  pass  filter  with  bandwidth  ~is  realized  with  a  block  state 


form  with  block  size  M  whose  representation  is  rewritten  in  equation  (3-8). 


Rn*  1  =  ARn  +  BXn 


*  O'O  _  **•  *  ’ 


*  O1..  .  •'v‘  ..'1 
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Yn  =  CRn  +  DXn  (3-3) 

The  sizes  of  the  four  matrices  A.  B,  C  and  D  are  respectively,  NxN,  N’xM,  MxN  and 

MxM,  where  N  is  the  order  of  the  filter.  Since  the  output  vector  Yn  is  not  fed 

back,  the  output  decimation  can  be  done  by  throwing  away  M-l  samples  in  each 

output  vector,  which  is  equivalent  to  calculating  only  the  first  sample  in  Yn. 

Therefore,  only  the  first  rows  of  matrices  C  and  D  are  required  to  compute  the 

output  samples.  Since  matrix  D  is  lower  triangular,  only  the  leftmost  entry  of 

the  first  row  is  non-zero.  Kence,  this  matrix  multiplication  is  reduced  to  a 

scalar  multiplying  the  first  samples  of  the  input  blocks. 

With  the  sizes  of  all  these  matrices  in  mind,  we  can  calculate  the  number  of 
multiplications  per  input  sample,  which  is  a  good  indicator  of  the  speed  of  this 
implementation.  For  an  N**  order  filter  with  all  complex  poles,  this  number  is 

N+  3 AL±L  ,3-9) 

This  number  is  compared  to  that  of  an  optimal  multistage  FIR  filter  approach 
developed  by  Crochiere  et,  oi.[30].  and  the  result  is  shown  in  Table  3-1. 


Decimation 

Ratio 

Filter 

Order 

Mult/Sample 
Data  Storage 


One  Stage 
(Block  State) 

Two  Stage 
(Block  State) 

Two  Stage 
(FIR) 

Dx=100 

Vi=20 

D2= 5 

Dx-5  0 
Dz-2 

Ni= 14 

Nx=4, 

Az=14 

Ai=423 
Nz= 347 

14.43 

5.43 

5.93 

114 

43 

769 

Three  Stage 
(FIR) 
Z?i=10 
Z>2=5 
2 

A*x=33 

Nz-33 

A^=35S 

4.06 

42S 


Table  3-1  Comparison  between  block  state  and  FIR  filters  for  Decimation 


Zeman[3l]  mentioned  that  block  state  filters  can  save  computation  only  if  the 
decimation  ratio  M  is  moderate,  however,  from  Table  3-1,  for  a  decimation  ratio 
of  100,  the  two  stage  block  state  filter  Is  comparable  to  the  two  stage  optimal 
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FIR  filters.  As  for  the  implementation  consideration,  block  state  filters  are 
much  simpler  than  FIR  filters  and  their  inherent  parallelism  can  result  in  very 
high  speed.  Notice  also  the  huge  difference  of  the  data  storage  requirement 
between  both  approaches.  The  multistage  FIR  filter  approach  requires  a 
memory  space  whose  size  is  an  order  of  magnitude  higher  than  that  of  a  block 
state  filter. 

Another  advantage  of  the  block  state  filter  as  well  as  the  optimal  FIR  filter 
over  many  other  structures  is  that  the  decimation  ratio  can  be  any  integer.  It 
does  not  have  to  be  a  number  of  a  power  of  2.  Half  band  digital  filter[32]  and 
wave  digital  filter  approaches,  for  instance,  do  require  this  constraint. 

The  block  state  filter  can  also  be  effectively  applied  to  the  interpolation. 
Suppose  the  filter  is  realized  with  a  block  size  1,  where  I  is  the  interpolation 
ratio.  Since  the  input  sequence  has  1-1  zeros  out  of  I  samples,  some  savings  in 
computation  can  be  made.  Matrices  B  and  D  are  the  only  two  terms  operating 
on  the  input  vectors.  If  choosing  the  first  samples  to  be  non-zero,  only  the  first 
columns  of  B  and  D  are  required.  This  reduces  the  number  of  multiplications  of 
these  two  matrices  from  NI+(I+l)I/2  to  N+I.  Furthermore,  this  nonzero  entry  in 
the  input  vectors  can  be  any  component  in  the  vector.  If  choosing  the  last  ele¬ 
ment.  the  operation  of  matrix  D  again  reduces  to  a  scalar  multiplication.  The 
average  computation  then  equals  that  of  equation  (3-9)  with  M  replaced  by  I. 

Both  decimation  and  interpolation  can  be  represented  by  a  single  block 
filter  representation  with  a  proper  ratio  between  the  input  and  output  block 
sizes.  In  the  scalar  filter  modeling,  decimating  the  output  samples  and  inserting 
zeros  in  the  input  sequence  cannot  be  modeled  as  a  linear  time  invariant  pro¬ 
cess.  Hence,  these  processes  have  to  be  modeled  separately  from  the  filtering 
operation.  However,  as  shown  above,  it  is  straightforward  to  model  both  decima¬ 
tion  and  interpolation  by  a  single  LSI  MIMO  filter. 


% 
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3.6.  Conclusions 


A  filter  structure,  which  can  achieve  any  desired  speed  with  localized  inter¬ 
connection  has  been  presented  in  this  chapter.  This  structure,  modeled  by 
state  equations  in  a  block  form,  utilizes  ertensive  pipelining  and  concurrency. 
State  equations  require  fixed  computational  rate  for  the  recursive  operation, 
even  when  the  block  size  changes,  since  the  state  variables  rather  than  the  out¬ 
put  samples  are  the  feedback  elements.  The  number  of  states  equals  that  of  the 
poles  of  a  filter,  and  hence  is  a  fixed  number  for  a  given  filter.  Any  desired 
speed  can  be  achieved  by  choosing  a  proper  block  size  so  that  the  state  feed¬ 
back  operation  can  be  completed  in  time.  Buffering  the  input  samples  also 
increases  the  parallelism  which  makes  pipelining  and  concurrency  feasible.  Sys¬ 
tolic  arrays  for  the  matrix-vector  multiplications  provide  a  very  simple  intercon¬ 
nection  which  makes  this  structure  very  attractive  in  the  VLSI  application. 
Furthermore,  this  structure  can  actually  be  applied  to  any  linear  shift  invariant 
systems,  which  can  be  represented  by  the  discrete  time  state  equations. 

The  applications  of  this  structure  to  computer  graphics  as  well  as  decima¬ 
tion  and  interpolation  are  also  presented.  The  regularity  of  the  interconnection 
in  systolic  arrays  and  the  lower  computational  rate  in  IIR  filters  compared  to  the 
equivalent  FIR  filters  make  this  structure  very  attractive  in  these  two  applicar 
tions.  However,  the  coefficient  adaptation  in  computer  graphics  must  be  solved 
before  this  structure  can  be  effectively  applied.  This  is  also  true  for  most  time 
varying  systems  which  are  modeled  by  time  varying  state  equations. 


CHAPTER  4 


PERFORMANCE  ANALYSIS 


This  chapter  compares  the  performance  of  the  realization  discussed  in 
chapters  2  and  3  in  various  respects.  The  performance  of  the  best  known  cas¬ 
cade  and  parallel  connection  of  second  order  filters  will  be  discussed  first  in 
every  respect  as  a  reference.  The  performance  of  the  other  structures  will  be 
treated  afterwards. 

Since  real  time  is  of  prime  concern  in  most  filters  design,  the  limitation  of 
realizable  sampling  rate  is  discussed  in  the  first  section  with  the  assumption  of 
no  message  transmission  delay.  The  effect  of  communication  delay  on  this  rate 
is  then  treated,  as  well  as  the  latency  of  the  signal  through  the  filter.  A  detailed 
comparison  of  the  computational  rate  of  FIR  and  HR  filters  will  then  follow.  This 
•will  indicate  that  the  UR  filter  is  best  for  processing  signals  with  low  to  moderate 
high  sampling  rate,  and  FIR  filters  are  best  for  extremely  high  sampling  rate. 
Finally,  the  percentage  of  idle  time  for  the  PE’s  will  be  discussed  to  investigate 
the  efficiency  of  the  hardware  usage. 

A  detailed  discussion  of  the  performance  of  a  block  state  filter  with  a  block 
diagonal  state  matrix  was  given  in  the  sections  3—,  3.3  and  3.4.  An  even  more 
complete  analysis  will  be  given  in  this  chapter  and  its  performance  will  be  com¬ 
pared  to  other  structures  mentioned  in  chapter  2.  After  finishing  the  analysis 
and  comparison  in  this  chapter,  it  will  become  clear  why  this  structure  outper¬ 
forms  any  other  structure  in  almost  all  respects. 


4.1.  Speed  Limitations 

In  this  section,  the  highest  achievable  throughput  rate  for  every  structure 
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•will  be  discussed.  Yfe  will  initially  assume  no  delay  for  message  transmission 
between  PE’s.  The  effect  of  transmission  delay  will  be  treated  in  the  next  sec¬ 
tion. 

Although  requiring  more  computations  than  SISO  filters,  block  processing  of 
the  input  samples  can  in  theory  achieve  any  desired  sampling  rate.  Real  time  is 
guaranteed  by  choosing  an  appropriate  block  size.  This  size  depends  on  the 
input  sampling  rate  and  the  speed  of  the  FE's.  In  practice,  however,  the  sam¬ 
pling  rate  is  limited  by  the  filter  latency  and  the  buffer  size.  The  highest  sam¬ 
pling  rate  for  SISO  filters,  on  the  other  hand,  is  limited  by  the  PE  speed.  If  the 
sampling  rate  is  higher  than  this  limit,  faster  PE’s  are  necessary.  Hence,  for  sig¬ 
nals  sampled  at  very  high  rate,  block  processing  is  a  better  approach. 

4.1.1.  Parallel  and  Cascade  Forms 

The  throughput  of  both  the  parallel  and  cascade  forms  is,  in  general, 
bounded  by  the  throughput  of  one  second  order  filter,  no  matter  how  complex 
the  original  filter  is.  For  the  cascade  structure  as  in  Figure  2-1  section  2.2.1, 
the  pipelining  processing  of  the  input  samples  gives  us  a  high  throughput  rate, 
which  depends  on  the  longest  processing  time  among  all  the  cascaded  stages.  If 
each  stage  is  either  a  second  or  a  first  order  filter,  the  throughput  rate  depends 
on  that  of  a  second  order  filter. 

For  the  parallel  structures  as  shown  in  Figure  2-2  section  2.2.2,  since  all  the 
second  order  filters  work  on  the  input  samples  simultaneously,  the  throughput 
rate  also  depends  on  the  longest  processing  time  among  all  sections.  If  the  final 
summing  node  is  lumped  into  one  of  the  filter  sections,  the  longest  processing 
time  is  the  processing  time  of  a  second  order  filter  plus  that  of  the  final  sum¬ 
ming  node.  If  this  node  is  cascaded  with  all  the  filter  sections,  the  throughput 
rate  then  depends  on  a  second  order  section  if  the  summing  node  can  be  pro¬ 
cessed  faster  than  a  second  order  filter.  The  difference  between  these  two 


structures  is  that  the  cascade  form  has  longer  latency  but  simple  communica¬ 
tion  links,  while  the  parallel  form  has  several-to-one  type  of  communication  at 
the  final  summing  node.  The  time  bound  for  both  forms  is  the  time  to  perform 
five  multiplications  and  four  additions,  if  a  direct  form  implementation  is  used 
for  each  section. 

4.1.2.  Systolic  Arrays 

Kung’s  systolic  array[7]  also  uses  extensive  pipelining  but  in  a  different 
manner  from  the  cascade  form.  In  this  array,  input  and  output  samples  are 
traveling  in  opposite  directions.  The  output  samples  are  also  fed  back  to  this 
array,  and  travel  in  the  same  direction  as  the  input  samples.  Thus,  this  struc¬ 
ture  can  be  viewed  as  three  sequences  pipelined  through  each  cell  at  the  same 
speed.  Due  to  this  pipelining  nature,  the  throughput  rate  depends  on  the  execu¬ 
tion  time  of  one  cell,  which  contains  two  multiplications  and  two  additions. 
Since,  as  noted  in  Figure  2-4  in  section  2.3,  every  second  cell  is  idle  at  any  given 
time,  the  actual  processing  time  per  output  sample  is  two  times  that  of  one  cell. 
Thus,  the  minimum  sampling  period  achievable  is  twice  that  of  the  execution 
time  of  one  cell,  which  contains  only  four  multiplications  and  four  additions. 
Therefore,  this  structure  can  achieve  a  slightly  higher  throughput  than  either 
the  cascade  or  the  parallel  form. 

The  advantage  of  this  structure  over  the  cascade  and  the  parallel  forms  is 
that  it  can  be  realized  directly  from  the  filter  difference  equation  No  factoriza¬ 
tion  or  partial  fractional  expansion  is  required.  This  simplifies  the  design  pro¬ 
cedure.  Furthermore,  similar  to  the  cascade  form,  this  structure  requires  only 
local  interconnection.  It  does  not  require  any  broadcasting  type  of  communica¬ 
tion  as  in  the  parallel  form  However,  the  roundoff  noise  behavior  is  similar  to 
that  of  the  direct  form  structure,  which  is  a  serious  problem  compared  to  the 


other  two  structures. 


4.1.3.  SSH!ID  mods 


Although  Barnwell's  structure  can  achieve  higher  sampling  rate  than  that  of 
the  three  forms  mentioned  above,  the  highest  rate  of  this  structure  is  still  Lim¬ 
ited.  The  average  time  to  generate  one  output  sample,  in  this  mode,  equals  the 
time  to  do  one  multiplication  and  addition[9].  This  time  bound  is  about  one  fifth 
of  that  of  the  parallel  and  cascade  structures,  if  using  the  same  PE's.  Thus, 
Barnwell's  algorithms  can  achieve  a  sampling  rate  five  times  faster  than  those 
two  structures. 

To  write  the  programs  in  SSIMD  mode,  the  only  Information  required  is  the 
length  of  the  program,  the  time  at  which  recursive  inputs  are  needed  and  recur¬ 
sive  outputs  are  available.  Hence,  a  task  with  a  single  recursive  output  as  in  Fig¬ 
ure  2-5b  in  section  2.5  can  be  characterized  as 
K(I(L).  I(L-l),  •».  1(1);  R;  T) 

where  K  is  the  task  identifier,  T  is  the  task  length,  R  is  the  output  time  for  the 
recursive  output,  1(f)  is  the  input  time  for  the  Ith  delayed  recursive  output,  and 
L  is  the  value  of  the  longest  delay.  Some  important  theoretical  results  for  this 
realization  are  summarized  as  follovrs[9]. 

(l)  All  PE's  are  started  at  equal  intervals and  the  outputs  are  periodic  with 
ttiT 

period  tm  =  - — — .  where  M  is  the  number  of  PE's  used.  This  is  true  only  if  M 

(4-la) 
(4-lb) 

where  M(f)  is  the  non-integer  number  of  PE's  which  could  be  utilized  if  the 
only  constraint  came  from  the  recursive  input  1(f),  Z*  is  the  value  of  f  for 
which  M(f)‘"  minimum,  and  INT[-]  means  "the  integer  part”. 


is  less  than  some  number  M*. 

(2)  The  upper  bound  in  (1)  is  given  by 


M*  =  INT[M(Z,)]  =  1ST 
IT 


M,  =  INT 
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(3)  The  greatest  throughput  achievable  is  obtained  with  a  time  skew  of 


tg  -  — — ■—  This  solution  is  generally  achieved  with  Mx  +  1  PE's  if  M(L)  is 
not  an  integer.  Adding  one  more  PE  will  introduce  some  idle  time  in  each 


PE;  in  other  words,  T  will  increase. 


Based  on  these  results,  it  is  easily  seen  that  given  a  signal  flow  graph,  the  max¬ 
imum  number  of  PE’s  is  immediately  available,  and  the  maximum  throughput 
rate  is  the  function  of  PE  speed  only. 

According  to  the  first  property,  the  more  PE's  that  can  be  used  without 
increasing  T.  the  higher  the  throughput  rate  this  structure  can  achieve.  How¬ 
ever.  it  is  not  possible  to  increase  the  number  of  PE's  indefinitely.  According  to 
property  2.  the  sampling  rate  saturates  when  this  number  reaches  some  value. 
Beyond  this  value,  increasing  the  number  of  PE's  does  net  help  improve  the 
speed  performance.  This  critical  number,  which  equals  to  Mx  +  1,  can  be 
obtained  by  examing  the  throughput  while  increasing  the  number  of  PE's  with 
the  assumption  that  all  the  input  samples  are  stored  somewhere  and  are  ready 
for  use.  This  assumption  ensures  us  that  the  speed  limitation  actually  comes 
from  the  structure  itself  but  not  from  the  finite  input  sampling  rate.  The 
number,  at  which  the  sampling  rate  starts  saturating,  is  the  maximum  number 
of  PE’s  can  be  used  to  increase  the  throughput  rate. 

The  previously  summarized  results  in  this  section  will  be  verified  by  consid¬ 
ering  the  limitations  on  a  specific  PE.  Since  it  is  single  instruction  multiple  data 
mode,  all  the  other  PE's  will  undergo  exactly  the  same  situation.  This  specific 
PE  is  examined  in  the  following  two  steps.  First,  the  starting  time  constraint  of 
this  PE  imposed  upon  by  the  availability  of  the  Ith  delayed  recursive  input  will  be 
considered.  This  number  is  obtained  by  delaying  the  starting  time  until  no  wait¬ 
ing  time  is  necessary  for  this  PE.  Then,  this  limitation  will  be  generalized  by 
considering  all  the  delayed  recursive  Inputs.  If  each  delayed  recursive  input 


gives  rise  to  different  constraint  of  starting  time,  the  latest  time  is  chosen  in 
obvious  fashion. 


Suppose  PE  0  starts  working  at  time  0  and  receives  ail  the  inputs  in  time. 
This  PE  can.  then,  transmit  its  recursive  output  at  time  R  and  the  final  output  y 
at  time  T.  This  recursive  output  ■Hill  be  used  as  the  Ith  delayed  recursive  input 
by  some  other  PE  l.  The  earliest  time  that  this  Ith  PE  can  start  working  is  at 
time  R-I(i).  This  is  clear  from  the  graph  shown  in  Figure  4-1.  If  it  starts  earlier, 
it  cannot  receive  the  output  from  PE  0  in  time;  hence,  some  idle  time  occurs. 
The  reason  that  this  receiving  PE  is  labeled  as  PE  l  is  that  there  must  be  another 
M  PE’s  which  use  this  recursive  output  as  their  -l)tA  delayed  recursive 

input.  These  i-1  PE’s  can  start  working  before  this  receiving  PE.  obviously. 

Hence,  the  time  skew  between  any  two  adjacent  PE’s  is  • 

In  the  next  cycle,  PE  0  will  start  working  following  the  last  PE.  The  time 
interval  between  the  starting  time  of  these  two  PE’s  is  also  ^  Since  PE  0 

will  resume  its  function  at  time  T,  if  it  can  start  working  on  or  before  T  in  the 
next  cycle,  and  no  waiting  time  will  occur.  The  maximum  number  of  PE’s  M(i) 
that  can  be  used  without  introducing  idle  time  must  satisfy  the  following  relation 


If  M?  PE’s  are  used  to  realize  the  filter,  no  PE  has  to  sit  idle.  This  realiza¬ 
tion  is  called  PE  optimum,  since  a  high  rate  filter  can  be  realized  using  a  max- 


imum  possible  number  of  PE's  without  introducing  any  idle  time.  However,  if 


x-m 


■is  not  an  integer,  introducing  an  extra  PE  can  achieve  a  slightly  higher 


sampling  rate.  This  realization  is  called  speed  optimum,  since  no  other  realiza¬ 
tion  in  the  SSIMD  mode  can  achieve  a  higher  sampling  rate.  On  the  other  hand, 
since  idle  time  exists,  it  is  not  PE  optimum. 


From  Figure  4*1,  the  shortest  possible  time  skew  between  any  two  adjacent 
PE’s  is  If  using  M,  +  l  PE’s,  PE  0  will  sit  idle  for  a  while  before  it  can 

start  the  computation  for  the  next  cycle.  Since  this  algorithm  is  SSIMD,  this  idle 
time  occurs  to  every  PE  and  hence,  lowers  the  efficiency  of  the  usage  of  the 
PE’s.  However,  each  PE  can  start  working  right  after  it  receives  data  from  some 
other  PE’s,  whereas  with  Mx  PE's,  each  PE  has  to  wait  until  the  previous  cycle 
finishes  even  if  the  input  data  is  ready  for  it  to  use.  With  another  extra  PE  ,  each 
PE  has  to  wait  a  whole  cycle  and  hence  the  throughput  rate  does  not  increase 
with  this  PE.  Hence,  one  extra  PE  lowers  the  PE  usage  efficiency,  but  achieves 
the  highest  possible  sampling  rate.  This  verifies  property  (3). 


Figure  4-2  shows  the  computation  sequence  of  one  PE  to  realize  a  third 
order  filter.  It  is  obvious  that  the  execution  time  between  any  two  delayed 
recursive  inputs  is  the  time  to  perform  one  multiplication  and  one  addition. 
However,  in  the  actual  implementation,  the  time  to  transfer  the  data  between 
the  PE’s  and  the  I/O  ports  should  also  be  included  in  this  time  skew.  Therefore, 
the  throughput  rate  is  actually  governed  by  the  time  for  fetching  one  sample 
from  the  input  port,  sending  one  sample  to  the  output  port,  and  the  time  to  per¬ 
form  one  multiplication  and  one  addition.  This  data  transfer  time  should  also  be 
added  to  the  other  two  structures  mentioned  in  sections  4.1.1  and  4.1.2. 


4.1.4.  Block  I/O  Filters 

Burrus  mentioned  in  his  paper[33]  that  the  block  filter  is  a  better  choice 
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rn  rn  +  bi-  rn_! 

rn  * —  rii  +  b2  •  rn-2 
rn  *—  rn  +  b 3  •  rn_3 

Vn  * - “0  •  *n 

Vn  •“  Vn  +  »l  ■  ’’n-l 
Vn  *—  Vn  +  a2  •  r„_2 

Vn  Vn  +  a3  •  rn_3 

Figure  4-2  Computational  Sequence  of  one  PE  for  the  Barnwell  Algorithm 

than  the  direct  form  implementation  only  if  the  order  of  the  filter  exceeds  25. 
This  number  was  obtained  by  comparing  the  number  of  computations  in  both 
structures.  Since  the  number  of  operations  per  output  sample  is  a  good  indica¬ 
tor  for  the  speed  performance  of  only  serial  processing  systems,  his  conclusion 
does  not  apply  to  the  case  of  parallel  processing.  For  parallel  processing,  the 
sampling  rate  depends  not  only  on  this  number  but  also  on  the  parallelism  in 
the  algorithm.  The  sampling  rate  can  even  be  totally  irrelevant  to  this  number, 
as  in  the  block  state  filter  implementation  described  in  section  3.4.  For  mul¬ 
tiprocessing,  it  is  advantageous  to  realize  in  block  form,  even  for  low  order 
filters,  if  using  a  large  number  of  PE’s  is  of  minor  concern. 

For  block  filters,  Burrus  assumed  that  the  computation  is  divided  into  two 
matrix-vector  multiplications  associated  with  BiYn  and  and  two  convolu¬ 
tions  associated  with  the  operations  of  Bq1  and  B£lAz.  These  two  convolutions 
are  computed  by  FFT  algorithms.  Although  the  FFT  can  reduce  the  computa¬ 
tional  rate,  block  processing  usually  increases  this  rate.  Furthermore,  although 
the  data  flow  of  the  FFT  is  pipelining  in  nature,  the  highly  complex  data  flow  oat- 
tern  between  adjacent  pipelined  stages  makes  local  connection  very  difficult; 
hence,  we  will  implement  the  product  matrices  directly  instead  of  dividing  into 


two  stages.  Therefore,  the  implementation  of  a  block  I/O  filter  will  be  treated  as 
three  matrix-vector  multiplications. 


As  noted  in  section  3.2,  the  maximum  possible  throughput  rate  depends  on 
how  the  feedback  product  matrix  Bq1Bi  is  realized.  The  implementation  of  this 
feedback  matrix  and  the  criterion  for  real  time  processing  will  be  discussed  in 
this  section.  Subsequently,  the  feedforward  matrices  ■will  be  treated.  Obviously, 
the  two  dimensional  systolic  array  discussed  in  section  3.4  can  be  employed  for 
the  implementation  of  these  feedforward  matrices.  However,  the  function  of 
each  cell  depends  on  the  implementation  of  the  feedback  matrix.  Since  this 
array  uses  extensive  pipelining,  it  cannot  be  applied  to  the  feedback  operation. 

Since  the  number  of  non-zero  columns  of  the  feedback  matrix  is  invariant 
as  the  block  size  L  changes,  the  number  of  operations  in  each  row  stays 
unchanged.  If  each  row  is  assigned  to  one  PE,  the  computational  rate  in  each  PE 
does  not  change  with  the  block  size.  For  an  N*h  order  filter,  it  takes  N(fB+fTO)  to 
compute  this  feedback  operation,  where  ta  and  tm  are  defined  as  in  section  3.2. 
Real  time  condition  is  satisfied,  if  the  following  equation  holds 


£i 


N(tg+tm) 

T, 


(4-2) 


For  a  given  filter,  a  finite  L,  which  satisfies  this  equation  no  matter  how  large  tm 
and  tg  are  and  how  small  T,  is,  always  exists.  In  other  words,  a  real  time  filter  is 
always  realizable  for  signals  sampled  at  very  high  rate  on  even  very  slow 
hardware.  Note  also  that  only  the  lower  right  N*N  submatrix  is  the  feedback 
term,  because  all  the  outputs  from  the  other  rows  are  multiplied  by  zeros  in  the 
next  block  period. 

In  the  operation  assignment  mentioned  above,  more  links  are  needed  to 
connect  the  PE's  when  the  block  size  increases.  Certainly  many  other  methods 
can  be  found  for  the  implementation.  It  is  possible  to  either  divide  each  row 
into  more  PE's  or  to  group  more  rows  into  one  PE.  The  former  method 
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complicates  the  communication  environment  and  the  latter  requires  larger 
block  size  and  hence  longer  latency.  Grouping  all  the  lover  N  rows  into  one  PE 
■would  be  a  good  way,  as  far  as  communication  complexity  is  concerned,  if  the 
much  longer  latency  is  tolerable.  This  increased  latency  might  severely  restrict 
the  applicability  of  this  algorithm. 

As  for  the  implementation  of  the  other  two  feedforward  product  matrices, 
the  only  constraint  is  that  the  execution  time  of  each  PE  should  not  be  larger 
than  LTt,  where  L  is  decided  by  the  implementation  of  the  feedback  matrix.  The 
structures  described  in  section  3.4  to  implement..the  feedforward  matrices  can 
also  be  used  to  realize  this  feedforward  operation.  However,  each  PE  in  this 
structure  should  contain  the  same  amount  of  operation  as  that  of  the  feedback 
PE's  so  as  to  fully  utilize  the  hardware.  This  is  not  an  easy  task,  because  of  the 
dissimilarity  among  the  structures  of  the  three  matrices.  Furthermore,  reduc¬ 
ing  the  communication  paths  in  the  feedforward  matrices  only  makes  little 
sense,  since  the  throughput  suffers  more  from  the  complex  interconnection  in 
the  feedback  operation  as  compared  to  the  feedforward  operations. 


4.1.5.  Block  State  Structures 

Within  the  context  of  block  state  filters,  we  have  presented  both  two  dimen¬ 
sional  systolic  arrays  and  block  cascade  and  parallel  approaches.  The  block  cas¬ 
cade  and  parallel  filter,  a  simple  extension  of  the  scalar  cascade  and  parallel 
forms,  is  going  to  be  studied  in  this  section  and  compared  to  the  systolic 
approach  described  in  section  3.4. 


4.I.5.I.  Systolic  Arrays 

As  described  in  section  3.2,  the  block  state  filter  can  achieve  any  desired 
sampling  rate  by  simply  selecting  an  appropriate  block  length.  This  is  true  for 
either  a  full  or  a  block  diagonal  state  matrix.  The  advantage  of  the  block  diago- 
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nal  form  over  the  full  matrix  is  that  it  requires  much  shorter  block  size  to 
achieve  the  same  sampling  rate.  This  implies  that  it  can  usually  realize  the 
same  filter  with  fewer  PE's  while  suffering  less  from  the  filter  latency.  Another 
advantage  of  the  block  diagonal  state  matrix  is  that  the  filter  can  be  easily  real¬ 
ized  with  a  systolic  approach,  which  requires  localized  interconnection  only. 


4. 1.5.2.  Block  Cascade  and  Parallel  Forms 


Unlike  the  other  block  filters  described  in  sections  2.5,  2.6  and  chapter  3 
that  have  the  property  that  any  throughput  rate  can  be  achieved  by  simply 
increasing  the  block  size,  Zeman  and  Lindgren’s  structures  have  an  upper  bound 
on  the  achievable  sampling  rate.  In  other  words,  real  time  cannot  always  be 
satisfied  by  increasing  the  block  size.  The  reason  is  that  the  internal  operation 
of  a  second  order  filter  is  serial  rather  than  parallel.  The  increased  parallelism 
generated  from  buffering  the  input  samples  is  not  fully  utilized  in  these  struc¬ 
tures. 

Similar  to  the  scalar  cascade  and  parallel  forms,  the  overall  throughput 
rate  depends  on  the  throughput  of  a  second  order  filter  section.  Also  due  to 
their  serial  processing  within  each  second  order  section,  the  overall  throughput 
depends  on  the  average  speed  of  an  inner  product  arithmetic  unit  to  generate 
one  output  sample.  Apparently,  this  speed  is  inversely  proportional  to  the  aver¬ 
age  number  of  multiplications  required  to  generate  one  output  sample.  This 
number,  in  turn,  depends  on  the  block  size  L  and  the  order  of  the  filter  N.  Con¬ 
sidering  that  the  D  matrix  is  lower  triangular,  the  total  number  of  multiplica¬ 
tions  for  generating  one  output  sample  is  : 


£tf(2+L)+ij 
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Differentiating  this  number  with  respect  to  L  and  setting  it  to  zero,  the  optimal 


block  length,  which  gives  us  minimum  number  of  operations,  can  be  obtained. 
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The  optimal  block  length  is 

V  =2V77 

The  number  of  operations  associated  vith  this  block  length  is 

2  N  +  2Vtf  +  0.5  (4-3) 

Hence,  the  maximum  achievable  sampling  rats  with  this  architecture  is  bounded 

by  the  number  in  (4-3)  and  the  chip  speed. 

This  bound  on  the  sampling  rate  can  be  removed  if  each  second  order  filter 
is  realized  in  parallel  with  a  systolic  approach.  However,  compared  to  the  sys¬ 
tolic  implementation  in  section  3.4,  the  block  cascade  form  has  longer  latency 
because  of  its  pipelining  nature.  Both  structures  require  more  computation 
than  the  systolic  structure  described  in  section  3.4  due  to  the  fact  that  D  matrix 
exists  in  each  second  order  subfilter,  and  each  one  has  the  same  size  as  the  D 
matrix  in  section  3.4.  The  total  computational  amount,  on  the  other  hand,  is  the 
same  for  the  other  three  matrices  in  each  structure. 

4.2.  Susceptibility  to  Transmission  Delay 

In  this  subsection,  the  effect  of  the  message  transmission  delay  on  the 
overall  sampling  rate  is  investigated.  It  is  well  known[34]  that  the  feedforward 
operation  can  be  put  into  pipelining  stages  and  any  number  of  delays  can  be 
inserted  without  affecting  the  overall  sampling  rate.  Hence,  the  delay  on  the 
feedforward  path  will  be  neglected.  However,  any  communication  delay  does 
increase  the  filter  latency.  The  effect  of  delay  on  any  recursive  computation  will 
be  carefully  investigated  in  this  section.  If  any  delay  exists  in  the  data  transfer 
among  the  PE’s  computing  the  feedback  operation,  the  speed  performance  will 
be  degraded. 

4.2.1.  Cascade  and  Parallel  Forms 

The  cascade  form  is  a  typical  pipelining  structure;  hence  any  transmission 


delay  on  the  links  connecting  the  adjacent  stages  doss  not  affect  the  overall 
sampling  rate  at  all.  For  the  parallel  form,  two  sets  of  communication  links  are 
required  for  the  data  transmission.  One  set  distributes  the  input  samples  to 
each  second  order  section.  The  second  set  of  links  transmits  the  results  of  each 
section  to  the  final  summing  node.  If  this  node  is  cascaded  with  all  the  other 
filter  sections,  the  transmission  delay  does  not  have  any  effect.  On  the  other 
hand,  the  delay  slows  down  the  filter  if  the  summing  node  is  lumped  into  one  of 
the  subfilters.  Furthermore,  the  filter  section  that  contains  the  summing  node 
has  the  longest  execution  time,  and  hence  dominates  the  throughput.  If  the 
connection  of  the  summing  node  is  cascaded,  transmission  delay  on  any  link 
does  not  affect  the  throughput  rate.  Since  the  whole  structure  can  be  viewed  as 
a  connection  of  three  blocks,  namely,  data  source,  filter  sections  and  the  final 
summing  node,  the  transmission  time  between  any  two  cascading  stages  does 
not  affect  the  sampling  rate. 

4.2.2.  Systolic  Arrays 

For  the  systolic  arrays,  any  transmission  delay  would  reduce  the  speed  per¬ 
formance.  Although  the  systolic  array  is  a  highly  pipelined  structure,  the  data 

are  pipelined  in  two  directions  for  the  IIR  filter  whose  structure  is  shown  in  Fig- 

\ 

ure  2-4  in  section  2.3.  The  behavior  of  this  two  way  pipelining  is  quite  different 
from  the  pipelining  in  the  cascade  form.  Each  cell  obtains  data  from  its  two 
neighbors,  and  its  outputs  are  sent  to  these  neighbors  for  the  operation  in  the 
next  cycle.  Hence,  any  transmission  delay  would  increase  the  time  interval 
between  fetching  two  adjacent  samples.  This  increased  time  interval  implies  a 
lower  sampling  rate;  However,  because  the  communication  links  are  localized 
and  the  number  of  links  does  not  increase  with  the  filter  order,  we  can  assign  a 
dedicated  high  bandwidth  link  for  each  message  transmission.  Therefore,  the 
transmission  delay  could  be  negligible  compared  to  the  execution  time  of  each 


cell.  Thus,  the  overall  sampling  rats  should  remain  essentially  unchanged  even 
considering  the  transmission  delay. 

4.2.3.  SSIMD  Mode 

The  transmission  delay  of  messages  will  seriously  degrade  the  performance 
of  Barnwell’s  structure.  Unfortunately,  due  to  its  extremely  complex  intercon¬ 
nection  (See  Figure  2-6b).  this  transmission  delay  seems  inevitable.  Further¬ 
more.  as  described  in  section  2.4,  the  highest  throughput  depends  on  the  availa¬ 
bility  of  all  its  delayed  recursive  inputs,  which  are  usually  generated  in  some 
other  PE’s.  Therefore,  a  delay  on  any  link  would  delay  the  available  time  of  the 
recursive  inputs  of  some  other  PE’s.  We  will  give  a  detailed  discussion  on  the 
effect  of  this  delay  time  on  the  sampling  rate. 

Barnwell  has  pointed  out[9]  that  the  largest  potential  problem  in  SSIMD 
solutions  concerns  the  inter-PE  communication.  However,  he  also  claimed  that 
the  fundamental  periodicity  of  the  SSIMD  solution  makes  the  communications 
requirements  very  uniform,  which  avoids  many  potential  time  conflicts.  He  also 
says  that  the  nature  of  the  communications  environment  can  be  systematically 
controlled  by  a  long  delay  chain.  The  first  claim  is  true,  however,  the  second  is 
true  only  if  shared  memories  is  allowed.  Owing  to  the  finite  speed  of  memory 
access,  the  memory  can  only  be  shared  among  a  limited  number  of  PE’s.  For 
very  high  rate  filters,  it  is  more  feasible  to  transfer  data  among  PE’s  rather  than 
to  store  them  in  a  common  memory  space.  Then,  the  extremely  complex  data 
flow  pattern  will  surely  cause  a  serious  problem  in  achieving  high  sampling  rate 
in  a  multiprocessing  system. 

The  delay  effect  can  be  analyzed  by  looking  at  Figure  4-1.  If  there  is  a  delay 
time  ”d’’  for  transmitting  the  recursive  output  of  PE  0  to  PE  1,  the  earliest  time 
that  PE  l  can  start  computing  is  also  delayed  by  the  same  amount.  Hence,  the 

skew  time  between  any  two  adjacent  PE’s  becomes  ^  .  The  maximum 
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number  of  PS’s  that  can  be  used  without  introducing  any  idle  time  becomes 

«.  =  imK>]  *  «■*> 

where  is  the  delay  time  for  the  i0*  delayed  recursive  input  on  PE  l.  Since  the 
throughput  rate  is  inversely  proportional  to  this  number,  the  effect  of  this  delay 
on  the  throughput  rate  is  easily  seen.  For  high  order  filters,  since  more  com¬ 
munication  links  are  required  for  each  PE,  the  throughput  rate  will  be  greatly 
slowed  down,  if  there  is  a  huge  delay  on  any  communication  link  to  any  PE. 

One  way  to  ease  this  problem  is  to  apply  this- structure  to  each  subfllter  of 
the  cascade  or  parallel  form.  Within  each  subfilter,  only  two  incoming  links  for 
the  recursive  inputs  are  required  and  the  delay  time  between  cascading  stages 
does  not  affect  the  overall  throughput.  Hence,  it  can  achieve  the  same  sampling 
rate  while  suffering  much  less  from  the  communication  delay.  However,  it  is 
impossible  to  totally  eliminate  the  delay  effect,  unless  only  one  PE  is  used. 

A  side  effect  of  this  complex  interconnection  is  that  since  the  number  of 
pins  on  each  chip  is  limited,  some  dedicated  switching  FE’s  might  be  needed  to 
handle  this  complex  data  transfer.  Therefore,  the  actual  number  of  PE's  may  be 
correspondingly  more  than  expected.  These. switching  PE’s  are  not  required  in 
some  simple  structures,  such  as  the  other  three  SISO  filter  structures.  Thus, 
this  extra  usage  of  PE's  should  also  be  counted  when  comparing  its  performance 
with  other  structures. 

4.2.4.  Block  I/O  Filters 

As  mentioned  earlier  in  this  section,  only  the  transmission  .elay  in  the 
feedback  operations  will  affect  the  overall  throughput  rate.  Hence,  for  the  block 
I/O  filter,  only  the  feedback  product  matrix  Bq'B j  has  to  be  considered.  The 
realization  of  the  other  two  product  matrices  is  not  essential  as  far  as  the 
throughput  rate  is  concerned. 


The  delay  effect  on  the  sampling  rate  for  the  block  I/O  filters  depends  on 
how  the  operations  of  the  feedback  matrix  are  scheduled.  If  any  communication 
links  exist  for  connecting  the  PS’s  computing  the  feedback  submatrix,  transmis  ¬ 
sion  delay  on  these  links  will  degrade  the  speed  performance.  This  delay  time 
may  introduce  some  idle  time  in  the  PE's,  and  hence  increase  the  actual  execu¬ 
tion  time  of  the  affected  PE’s.  The  block  size  is  then  determined  by  the  PE 
which  has  the  longest  execution  time.  Fortunately,  the  size  of  the  feedback 
matrix  does  not  change  with  the  input  block  length.  Hence,  the  communication 
environment  will  not  be  more  complicated,  even  if  the  block  size  has  to  increase 
to  compensate  this  transmission  delay. 

If  multiprocessing  is  employed  to  compute  the  feedback  submatrix,  a 
broadcasting  type  of  communication  is  required.  Hence,  the  transmission  delay 
is  likely  to  occur  in  such  a  complex  communication  environment.  Assigning 
each  row  to  one  PE.  as  mentioned  in  section  4.1.4,  is  not  a  good  scheduling  as  far 
as  the  delay  effect  is  concerned.  One  way  to  avoid  this  delay  effect,  of  course,  is 
to  assign  the  NxN  lower  right  submatrix  of  BqxB\  to  one  PE.  Then,  the  feed¬ 
back  part  is  contained  within  a  single  PE,  and  hence  the  sampling  rate  is  not 
sensitive  to  any  transmission  delay.  However,  because  of  the  longer  execution 
time,  the  filter  latency  is  much  longer  (especially  for  high  order  filters). 

4.2.5.  Block  State  Implementation 

In  the  case  of  block  diagonal  state  matrix,  if  each  submatrix  on  the  diagonal 
is  assigned  to  one  PE,  the  transmission  delay  has  no  effect  on  the  throughput 
rate,  for  no  communication  is  required  for  the  feedback  matrix.  Therefore, 
throughput  rate  stays  the  same  even  if  there  is  a  huge  delay  on  any  data 
transfer  path.  For  the  systolic  approach,  since  data  flows  in  a  regular  and  sim¬ 
ple  manner,  the  data  transmission  can  be  very  efficient.  As  distinct  from  the 
systolic  array  filter  described  by  Kung,  only  one  link  is  required  between  any  two 


adjacent  cells  and  the  data  transmission  is  unidirectional.  Hence,  more  pins  can 
be  used  for  one  message  transmission.  This  makes  the  data  transfer  even  more 
efficient 

For  Zeman’s  structures,  the  communication  environment  is  similar  to  the 
parallel  and  cascade  forms  and  hence  the  throughput  is  insensitive  to  the  com¬ 
munication  delay. 

4.3.  Latency  Consideration 

As  mentioned  in  section  3.2,  latency  is  sometimes  an  importark  factor  that 
limits  the  applicability  of  digital  filters  to  real  time  signal  processing.  Although 
most  signal  processing  or  communication  systems  can  tolerate  latency  in  some 
applications,  such  as  real  time  speech  or  where  a  filter  is  in  a  feedback  loop, 
latency  is  important,  it  is  usually  desired  that  this  delay  be  as  short  as  possible. 
Most  importantly,  latency  is  very  susceptible  to  the  message  transmission  delay. 
Since  latency  includes  both  the  execution  time  of  the  PE's  and  the  transmission 
delay  time,  the  latency  r,i  a  structure  whose  throughput  is  susceptible  to  the 
transmission  delay  is  also  sensitive  to  this  delay. 

4.3.1.  SI30  niters 

For  the  cascade  form,  the  filter  latency  is  the  sum  of  the  execution  time  of 
all  the  cascading  stages  if  there  is  no  transmission  delay.  Thus,  this  latency  is 
similar  to  that  of  a  direct  form  realization.  However,  since  the  constant  term  is 
distributed  intc  all  the  stages,  the  total  execution  time  is  usually  larger  for  the 
cascade  form  than  the  direct  form.  Besides  this,  the  transmission  delay  on  each 
link  adds  to  the  overall  latency. 

The  parallel  form  has  much  smaller  latency  than  the  cascade  form.  The 
latency  equals  to  the  execution  time  of  a  second  order  filter  plus  the  processing 
time  of  the  final  summing  node  as  well  as  the  transmission  delay  on  the  links 
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from  the  filter  sections  to  this  summing  node.  In  the  worst  case,  the  highest 
transmission  delay  among  all  the  links  connecting  all  the  filter  sections  to  the 
summing  node  should  be  added  to  the  overall  latency.  Thus,  the  delay  effect  on 
the  latency  is  not  so  severe  ns  in  the  cascade  form. 

As  for  the  systolic  array  realization,  since  each  output  sample  goes  through 
all  the  cells  before  it  can  be  sent  out,  the  overall  latency  equals  that  of  the 
direct  form.  As  discussed  in  section  4.2.2,  the  transmission  delay  should  be 
negligible  for  this  implementation. 

The  latency  of  Barnwell's  structure  is  also  equivalent  to  that  of  a  direct 
form  realization,  if  the  number  of  PE’s  does  not  exceed  Ma  +  l. 

4.3.2.  Block  Filters 

For  block  state  filters,  latency  is  extremely  important,  since  it  is  the  only 
limiting  factor.  For  block  filters,  the  overall  latency  is  also  governed  by  its 
structure.  Since  the  two  dimensional  systolic  array  structure  for  the  block 
state  filter  is  the  most  attractive  structure,  its  latency  behavior  deserves 
further  investigation.  A  problem  that  was  left  open  in  section  3.4  is  the  way  of 
grouping  four  cells  into  one  PE  We  give  here  a  detailed  discussion  about  how 
this  cell  grouping  can  affect  the  latency. 

From  Figure  3-5,  it  is  clear  that  the  input  data  can  go  through  either 
matrices  B  and  A  or  matrix  D  to  get  to  the  entries  of  matrix  C.  Since  the  state 
variables  can  be  precomputed  in  the  previous  time  interval,  The  overall  latency 
is  the  sum  of  the  execution  time  of  all  the  cells  on  the  longest  path  from  the 
input  to  the  output  through  matrix  D  and  matrix  C.  In  other  words,  the  max¬ 
imum  number  of  cells  on  those  paths  determines  the  latency  and  should  be 
minimized.  From  the  above  argument,  obviously,  the  best  way  to  group  the  cells 
is  to  combine  four  cells  on  the  same  column  in  either  matrix  D  or  matrix  C  into 


one.  Since  all  the  columns  are  operated  in  parallel,  this  combination  would 
minimize  the  number  of  cells. 

For  perfect  transmission  (no  transmission  delay),  the  filter  latency  is  sim¬ 
ply  the  overall  execution  time  on  the  path  which  requires  the  longest  time. 
Therefore,  for  the  linear  systolic  array  and  SSIMD  approaches,  the  latency 
equals  to  that  of  a  direct  form  implementation,  which  has  2N  multiplications, 
where  N  is  the  order  of  the  filter.  For  the  block  state  filters,  the  number  of  mul¬ 
tiplication  on  the  longest  path,  which  passes  through  matrices  D  and  C.  is  L+N,  if 
four  cells  on  the  same  column  are  grouped  together.  Since  L  is  usually  much 
larger  than  N  for  high  sampling  rate  signals,  this  latency  is  longer  than  that  of 
the  SISO  filters. 

4.4.  Number  of  Operations 

IIR  filters  are  known  to  require  lower  computational  rate  than  their 
equivalent  FIR  filters.  Hence,  unless  linear  phase  is  essential  or  there  is  some 
restriction  precluding  recursive  operation  such  as  adaptation.  IIR  filters  ere 
usually  used  to  save  chip  area.  However,  as  the  block  size  increases  for  block 
filters,  the  average  number  of  multiplications  required  to  generate  one  output 
sample  also  increases.  On  the  other  hand,  it  is  straightforward  to  increase  the 
throughput  rate  of  an  FIR  filter  while  keeping  the  number  of  multiplications  per 
output  sample  unchanged.  Hence,  block  HR  filters  lose  their  advantage  over  FIR 
filters,  at  high  sampling  rates.  This  implies  that  FIR  filters  are'  better  choices  for 
processing  signals  sampled  at  extremely  high  rates. 

Although  the  average  number  of  operation.®  for  generating  one  output  sam¬ 
ple  may  not  be  important  for  a  multiprocessing  system,  it  does  reflect  the  die 
area  or  the  number  of  chips  required.  Thus,  among  the  structures  that  can 
achieve  the  same  desired  throughput,  it  is  advantageous  to  choose  the  structure 
with  minimum  number  of  multiplications.  However,  some  other  factors,  such  as 
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the  complexity  of  implementation  and  interprocess  or  communications  etc., 
should  also  be  considered  before  deciding  which  structure  to  choose.  In  this 
section,  we  will  concentrate  on  the  comparison  of  computational  rate  between 
FIR  and  IIR  block  state  filters. 


Although  the  FIR  filter  structure  as  shown  in  Figure  1-2  with  direct  form 
implementation  in  each  section  can  achieve  any  desired  rate  as  a  block  state 
filter,  the  input  data  distribution  pattern  would  be  very  complex.  It  is  not  fair  to 
compare  the  number  of  operations  between  this  structure  and  the  block  state 
structure  as  shown  in  Figure  3-7  in  section  3.4.  The  communication  overhead  of 
Figure  1-2  might  severely  degrade  the  filter  performance.  Although  the  sam¬ 
pling  rate  may  not  be  affected,  the  extra  hardware  for  handling  the  data  switch¬ 
ing  seems  unavoidable.  Hence,  the  number  of  operations  itself  does  not  fully 
reflect  the  hardware  complexity.  On  the  other  hand,  the  block  FIR  filter  imple¬ 
mentation  as  shown  in  Figure  3-6b  has  a  very  similar  structure  to  the  block 
state  filter.  The  computational  rates  of  these  two  structures  will  be  compared. 

For  an  TV**  order  block  state  IIR  filter  with  block  diagonal  state  matrix,  the 
total  number  of  multiplications  to  generate  an  output  vector  is  given  as 

ZN  +  2  LN  + 

This  number  is  obtained  by  assuming  that  all  the  poles  are  complex  and  distinct 
and  by  considering  the  fact  that  the  matrix  D  is  lower  triangular.  Therefore,  the 
average  number  of  multiplication  to  generate  one  output  sample  is 

ZN  +  ~  (4-5) 

Suppose  an  FIR  filter  with  M  multiplications  per  output  sample  has  similar  fre¬ 
quency  response  as  the  N **  order  IIR  filter.  We  should  use  an  FIR  filter  instead 
of  an  IIR  filter,  if  the  block  size  L  satisfies  the  following  relationship. 
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The  value  M  depends  on  the  actual  implementation  of  the  FIR  filter  and  how  we 
compare  HR  and  FIR  filters.  If  the  fact  that  computation  might  be  saved  due  to 
the  symmetry  of  FIR  filter  coefficients  is  neglected,  this  value  is  the  number  of 
taps. 

The  comparison  between  FIR  and  HR  filters  is  not  simple  Rabiner  et.  al.[35] 
made  a  comparison  by  looking  at  the  passband  and  stopband  ripples  as  well  as 
the  transition  band.  In  Fig.  7b  of  their  paper,  a  sixth  order  elliptic  filter  is 
approximately  equivalent  to  a  41-tap  FIR  filter.  Plugging  these  numbers  into 
equation  (4-4).  an  FIR  filter  should  be  used  if  L  exceeds  56  (See  Figure  4-3).  This 
number  depends  heavily  on  the  criteria  for  comparison  and  the  implementation 
of  the  FIR  filter.  So,  this  upper  bound  of  L  should  be  computed  for  each  filter 
design. 

If  a  typical  existing  signal  processor  is  chosen  as  a  PE  to  realize  the  IIR 
block  state  filter  with  a  two  dimensional  systolic  array  structure,  the  sampling 
rate  with  a  block  size  at  the  cross  point  in  Figure  4-3  can  be  estimated  to  be 
high  enough  for  most  applications.  Suppose  the  internal  clock  rate  of  each  PE  is 
5MHz  and  the  1/0  overhead  can  be  ignored.  The  block  state  filter  structure  can 
generate  a  block  of  output  samples  (55  samples  in  this  case)  in  the  time  of  per¬ 
forming  four  multiplications  and  four  additions.  If  the  PE's  can  perform  accu¬ 
mulation  (one  multiplication  and  one  addition)  in  one  clock  cycle,  the  vector 
throughput  rate  will  be  1.25MHz.  The  scalar  sampling  rate  can  then  be  easily 
calculated  to  be  1.25x58  =  70MKz.  In  most  practical  applications,  this  sampling 
rate  is  higher  than  necessary.  In  other  words,  the  IIR  block  state  filters  should 
be  adopted  for  most  applications. 

An  obvious  fact  that  can  be  inferred  from  Figure  4-3  is  that  the  average 
number  of  multiplications  decreases  initially  as  the  block  size  increases.  The 
number  of  multiplications  per  output  sample  reaches  a  minimum  point  for  a 


certain  block  size,  which  is  greater  than  1.  This  is  similar  to  Zeman’s  conclusion 
on  the  optimal  block  length.  Thus,  even  considering  serial  processing  only  in 
state  space  design,  buffering  input  samples  for  some  size  can  achieve  higher 
throughput  rate  than  the  corresponding  SISO  filter. 

4.5.  Hardware  Usage 

SISO  filters  can  easily  achieve  a  highly  efficient  usage  of  hardware.  This 
high  efficiency  is  inherent  in  the  algorithms.  All  the  four  SISO  structures 
described  in  chapter  2  have  exactly  the  same  operation  for  almost  all  the  PE’s. 
Barnwell's  structure  is  perfect  in  this  aspect,  because  it  is  an  SIMD  mode  struc¬ 
ture.  All  the  PE's  execute  the  same  code  but  on  different  input  samples.  The 
hardware  usage  for  block  filters,  on  the  other  hand,  is  not  inherent  in  the  filter 
equations.  In  the  tree  structure  of  inner  products  as  shown  in  Figure  3-1,  for 
instance,  is  not  easy  to  find  a  perfect  scheduling  such  that  all  the  PE's  are 
evenly  loaded. 

The  systolic  array  realization  of  block  state  filters  can  not  only  achieve  high 
throughput  with  localized  communication  but  also  be  verified  as  near  optimum. 
It  is  optimal  in  the  sense  that  it  requires  the  least  number  of  PE’s  to  achieve  the 
same  throughput  compared  to  any  other  structures.  Intuitively,  since  each  cell 
in  Figure  3-7  contains  exactly  the  same  number  of  operations,  no  idle  time  is 
required  for  any  PE  to  wait  for  the  data  from  any  other  PE's.  However,  since  it 
is  unlikely  that  the  total  number  of  entries  in  each  of  the  three  big  rectangles  in 
Figure  3-7  is  a  multiple  of  four,  some  cells  must  have  less  number  of  operations. 
This  is  why  this  structure  is  called  near  optimum. 

Assuming  perfect  division,  the  optimality  can  be  easily  proven  by  Renfor’s 
theorem[36].  Before  showing  this  theorem,  some  important  terminology  is  sum¬ 
marized  as  follows. 


(2)  The  arithmetic  delay  is  simply  the  time  to  perform  an  arithmetic  opera¬ 
tion.  The  time  for  an  addition  is  associated  rith  the  link  that  comes  out  of 
the  adder. 

For  the  Ith  loop,  define 

A  = 

«CJ 

as  the  total  time  for  all  the  arithmetic  operations  in  this  loop.  In  order  that  thfo 
loop  is  computable,  the  following  relation  must  satisfy: 


where  T  is  the  unit  sample  delay  and  rij  is  the  number  of  unit  delays  in  this  loop. 
The  above  equation  must  hold  for  all  the  loops  in  the  network.  Let’s  choose 

To  =  ma xiDi/nrf  (4-7) 

where  the  maximum  is  over  all  the  directed  loops.  A  loop  in  which  this  max¬ 
imum  is  reached  is  called  a  critical  leap.  Now  the  theorem  can  be  stated  as  fol¬ 
lows. 

Theorem  4-1 

A  digital  filter  network  with  unit  delay  T  is  computable  if  and  only  if  T  &  To. 

where  To  is  defined  by  (4-6). 

For  the  block  state  filters,  since  any  sampling  rate  is  achievable,  the  max¬ 
imum  achievable  sampling  rate  described  in  the  above  theorem  is  meaningless 
In  this  case.  Therefore,  the  optimality  issue  should  be  addressed  in  a  different 
manner.  In  the  following  discussion,  we  define  the  optimal  network  as  a  struc¬ 
ture  which  can  achieve  a  given  sampling  rate  with  the  minimum  number  of  PE’s. 
Assume  that  the  filter  has  only  simple  order  poles  and  the  state  matrix  is  in  a 
block  diagonal  form. 

Since  there  is  only  one  delay  element  in  the  block  state  filter,  the  highest 
sampling  rate  can  be  decided  by  looking  at  the  feedback  operation  by  Renfor*s 
theorem.  Since  this  operation  also  determines  the  block  sizes,  the  comparison 
between  different  numbers  of  PE’s  for  different  block  sizes  is  equivalent  to  the 
comparison  between  their  average  number  of  operations  used  to  achieve  some 
fixed  sampling  rate.  From  Figure  4-3,  it  is  clear  that  an  optimal  block  size  exists 
which  is  always  larger  than  1.  Actually,  from  section  4.1.5  it  is  clear  that  this 
number  is  Lgpt  =2v77 . 

From  section  3.2,  it  is  clear  that  grouping  one  submatrix  into  one  PE  can 
give  us  a  simple  interconnection  with  the  smallest  block  size.  Suppose  this 


block  size  is  Lq.  Since  grouping  multiple  submatrices  into  one  PE  also  results  in 
simple  interconnection  among  PE’s,  the  best  choice  of  the  block  size  should  be  a 
multiple  of  La,  which  is  closest  to  L^pt-  For  most  applications,  L^  is  small 
enough  and  hence  Lq  is  usually  the  best  block  size  to  choose.  However,  if  a 
larger  block  size  is  necessary  for  the  optimization,  the  accompanied  longer 
latency  should  also  be  considered  when  choosing  the  block  size. 

4.6.  Conclusions 

The  comparison  of  performance  among  all  the  filter  structures  mentioned 
in  chapters  2  and  3  has  been  discussed  in  this  chapter.  It  appears  that  with  con¬ 
ventional  single-input  single-output  systems,  an  upper  bound  on  the  sampling 
rate  always  exists.  Among  the  SI30  filters,  Barnwell's  structure  can  achieve  the 
highest  sampling  rate  if  the  same  hardware  is  used.  On  the  other  hand,  the 
block  processing  of  the  input  samples  can  remove  this  limitation.  Higher 
throughput  rate  can  be  achieved  with  larger  input  block  size.  Hence,  when  high 
throughput  rate  is  desired,  block  filters  rather  than  SISO  filters  should  be  used. 

When  circuit  size  gets  large,  the  data  transmission  within  a  chip  or  among 
chips  might  take  more  time  and  may  affect  the  speed  performance.  Structures 
such  as  systolic  arrays  require  only  local  and  regular  interconnections  and 
hence  data  transmission  paths  can  be  short  and  the  transmission  time  can  be 
short.  Furthermore,  this  local  interconnection  is  also  advantageous  for  VLSI 
design,  since  the  interconnection  wires  are  expensive  in  VLSI  circuits.  There¬ 
fore,  a  block  state  filter  with  systolic  arrays  is  the  most  efficient  structure  as 
measured  by  the  throughput  and  the  transmission  cost.  Barnwell’s  structure 
suffers  more  from  the  data  transmission  delay  due  to  its  complex  interconnec¬ 
tion  pattern.  Hence,  although  it  can  achieve  higher  sampling  rate  than  the 
other  SISO  filters,  the  transmission  delay  might  make  it  inferior. 
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The  comparison  between  the  block  state  filters  and  their  equivalent  FIR 
filters  is  also  presented.  It  is  clear  from  section  3.1  that  the  two  dimensional 
systolic  array  structure  can  also  be  applied  to  the  implementation  of  FIR  filters. 
Hence,  the  FIR  filter  can  be  realized  with  only  local  interconnection  among  pro¬ 
cessing  elements.  The  throughput  rate  of  the  FIR  filter  can  be  increased  by 
increasing  the  dimension  of  this  array  in  a  similar  fashion  as  the  block  filters. 
However,  the  average  computation  of  an  FIR  filter  does  not  increase  with  the 
block  size,  whereas  for  block  filters,  this  rate  does  increase  with  the  block  size. 
This  rate  affects  the  requirement  on  the  number  of  PE's  and  hence  should  be 
minimized.  The  comparison  between  the  rates  of  FIR  and  block  filters  shows 
that  for  very  high  sampling  rate  filters,  the  FIR  filter  is  a  better  choice  in  terms 
of  the  hardware  size. 


CHAPTER  5 


EFFECTS  OF  FINITE  REGISTER  LENGTH 

From  the  discussion  in  the  previous  chapters,  that  block  state  filters  out¬ 
perform  all  the  other  parallel  structures  in  overall  throughput  rate.  However, 
another  critical  issue,  which  has  not  been  discussed  yet  but  would  effect  the 
complexity  of  the  actual  filter  implementation,  is  the  effect  of  finite  register 
length.  A  structure  which  is  sensitive  to  this  effect  requires  more  bits  for 
representing  each  samples,  and  hence  would  slow  down  the  arithmetic  opera¬ 
tions.  Although  these  lower  speed  arithmetic  operations  do  not  prevent  us  from 
achieving  high  throughput  rate  if  the  block  state  structure  is  used,  they  do 
increase  the  required  value  of  L  and  hence  the  amount  of  hardware.  Therefore, 
the  finite  word  length  behavior  of  the  block  state  filters  will  be  studied  in  detail 
in  this  chapter. 

5.1.  Introduction 

When  a  digital  filter  is  implemented  with  either  a  dedicated  digital  hardware 
or  as  a  computer  program,  its  coefficients  and  the  input  and  output  samples 
have  to  be  quantized.  This  finite  precision  of  filter  coefficients  and  sample 
values  will  cause  some  error  in  the  output  signal.  The  scurces  of  this  quantiza¬ 
tion  error  are  summarized  as  follows[37], 

(1)  The  filter  coefficients  must  be  quantized  to  some  finite  number  of  bits. 

(2)  The  input  samples  to  the  filter  must  also  be  quantized  to  a  finite  number  of 
bits. 

(3)  The  products  of  the  multiplications  within  the  filter  must  usually  be 
rounded  or  truncated  to  a  smaller  number  of  bits  prior  to  subsequent 
operations,  since  each  multiplication  doubles  the  number  of  bits  for  full 


precision. 

(4)  When  floating-point  arithmetic  is  used,  rounding  or  truncation  must  usuaily 
be  performed  before  or  after  additions  as  well. 

The  complex  arithmetic  units  also  imply  lower  speed  for  each  operation.  There¬ 
fore,  the  finite  register  length  effect  is  an  important  issue  to  characterize  for 
each  filter  structure. 


The  first  error  source  will  be  briefly  treated  in  the  next  section  and  the 
third  and  the  fourth  error  sources,  which  are  usually  called  roundoff  noise,  will 
be  discussed  afterwards.  As  for  the  second  error  source,  since  it  does  not  con¬ 
cern  the  filter  structure  and  there  is  no  way  to  minimize  this  error  by  optimiz¬ 
ing  the  filter  structure,  it  will  not  be  discussed  here.  The  discussion  of  this  error 
which  is  usually  referred  to  as  quantization  noise,  can  be  found  in  most  com¬ 
munication  books.  The  third  and  fourth  errors  are  different  from  the  second 
error  source  in  two  respects 

(i)  The  data  to  be  quantized  is  already  digital  in  form, 
and 

(ii)  the  rounding  or  truncation  of  the  data  takes  place  at  various  positions 

within  the  filter,  not  just  at  its  input. 

Because  of  (ii).  the  roundoff  error  is  potentially  much  larger  than  the  input 
quantization  error,  and  is  one  of  the  principal  factors  which  determine  the  com¬ 
plexity  of  the  filter  implementation. 

The  exact  analysis  of  the  roundoff  error  behavior  is  extremely  difficult  since 
the  generation  of  this  error  is  a  highly  nonlinear  process.  It  is  helpful  to  per¬ 
form  an  approximate  analysis  by  representing  the  effect  of  rounding  in  terms  of 
an  additive  error  signal  which  is  referred  to  as  roundoff  noise.  The  filter  can 
then  be  modeled  as  a  linear  filter  associated  with  noise  sources  at  various 
places.  The  output  noise  statistics  can  then  be  analyzed  by  standard  random 


process  techniques. 


For  simplicity,  only  fixed  point  arithmetic,  where  rounding  occurs  only  in 
multiplication  but  not  in  addition,  will  be  considered.  For  two’s  complement 
arithmetic,  it  is  usually  assumed  that  the  roundoff  noise  of  a  single  rounding  can 
be  modeled  as  a  white  noise  with  a  variance  given  by[l.  Chap.  9]. 


where  A  is  the  quantization  step  size.  With  fixed  point  arithmetic,  this  step  size 
is  uniform  over  the  whole  dynamic  range  and  hence  a  constant  power  noise 
source  is  associated  with  each  rounding  error.  For  an  n-bit  register  length,  this 
step  size  is  2"*,  if  the  dynamic  range  is  unity.  When  multiplying  two  n-bit 
numbers,  a  2n  bit  register  is  required  to  keep  the  accuracy  of  the  product.  This 
outcome  must  be  rounded  to  n-bit,  if  it  is  fed  back  as  an  input  for  the  calcula¬ 
tion  in  the  next  cycle.  If  this  outcome  is  not  fed  back  in  the  next  cycle,  a  wider 
register  can  be  used  to  keep  its,  full  precision.  This  is  why  recursive  filters  are 
potentially  noiser  than  nonrecursive  filters.  Thus,  an  error  results  with  a  mean 

g-2n 

square  value  The  equivalent  filter  considering  the  roundoff  noise  of  the 
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original  direct  form  second  order  filter  is  shown  in  Figure  5-1.  In  this  graph,  the 
rounding  is  assumed  to  occur  after  each  multiplication  and  before  the  summa¬ 
tion. 

The  effect  of  the  first  error  source  will  be  discussed  yi  the  next  section. 
Emphasis  will  be  put  on  the  stability  problem,  which  results  from  the  pole  move¬ 
ment  due  to  the  finite  precision  of  the  filter  coefficients.  Block  state  filters  will 
be  shown  to  be  good  structure  in  thi«  aspect  The  roundoff  noise  behavior  will  be 
discussed  in  the  following  sections.  A  detailed  derivation  for  filters  in  the  state 
space  design  will  be  presented  to  verify  that  block  state  filters  have  less 
roundoff  noise  as  the  block  size  increases. 


5.2.  Stability  Consideration 

The  first  error  source  usually  results  in  some  deviation  in  the  actual  fre¬ 
quency  response  from  the  desired  one.  Hence,  the  implemented  filter  has  a  fre¬ 
quency  response  which  is  not  exactly  the  same  as  with  infinite  precision 
coefficients.  An  even  more  serious  problem  resulting  from  the  finite  precision  of 
the  filter  coefficients  is  the  stability  problem.  Quantization  of  the  filter 
coefficients  to  finite  length  can  cause  a  shift  of  the  pole  locations.  If  a  filter  has 
a  pole  close  to  the  unit  circle  with  infinite  precision,  this  quantization  may  push 
this  pole  out  of  the  unit  circle  and  hence  make  the  filter  unstable. 

For  the  block  state  filter  realization,  the  filter  stability  becomes  less  sensi¬ 
tive  to  coefficient  quantization,  as  the  block  length  L  increases.  For  a  minimal 
realization  in  the  state  space  domain,  the  eigenvalues  of  the  state  matrix  are 
equivalent  to  the  poles  of  the  filter.  According  to  equation  (2*32),  the  eigen¬ 
values  of  the  state  matrix  of  a  block  state  equation  are  the  corresponding  eigen¬ 
values  in  the  scalar  case  raised  to  the  power  L,  where  L  is  the  block  size.  If  the 
poles  are  originally  located  within  the  unit  circle,  increasing  the  block  size  will 
move  all  the  poles  toward  the  origin,  making  instability  due  to  coefficient  quanti¬ 
zation  less  likely. 

This  pole  dependence  on  L  also  has  a  great  impact  on  the  roundoff  noise 
behavior,  as  will  be  discussed  in  detail  later  in  this  chapter.  However,  intui¬ 
tively,  a  smaller  eigenvalue  implies  a  shorter  time  constant.  Hence,  the  noise 
sources  generated  on  the  internal  nodes  decay  faster  in  time  in  the  block  case 
than  in  the  scalar  case.  Suppose  a  filter  has  a  single  eigenvalue  X  and  an  initial 
state  Oq  at  time  0,  then  this  initial  state  will  contribute  to  the  output  by  an 
amount  acXn  at  time  n.  This  can  be  easily  seen  in  Figure  5-2,  which  is  a  signal 
flow  graph  of  a  single  pole  filter.  For  a  fixed  initial  state,  as  time  goes  on,  the 
contribution  of  this  state  to  the  output  decays  faster  for  a  smaller  eigenvalue 
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Figure  5-2  Signal  Flow  Graph  of  a  Single  Pole  Filter 
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than  for  a  larger  eigenvalue.  If  at  time  m  (tnSn),  a  noise  is  generated  at  the 
summing  node  with  power  o^,  the  overall  output  noise  power  at  time  n  is  the 
sum  of  all  the  output  noise  powers  contributed  from  the  noise  sources  gen¬ 
erated  before  time  n.  This  overall  noise  power  at  time  n  can  be  represented  by 

O £  ~  2  OcX" 

mmO 

Obviously,  for  smaller  eigenvalues,  or  equivalently  smaller  pole  magnitudes,  the 
overall  output  noise  power  at  time  n  is  less  than  for  larger  eigenvalues.  In  con¬ 
ventional  filter  design,  the  filter  response  will  be  totally  different  if  the  pole  posi¬ 
tion  is  shifted.  On  the  other  hand,  the  block  state  filter  can  achieve  the  same 
scalar  transfer  function  while  reducing  the  magnitudes  of  all  the  poles;  hence, 
the  roundoff  noise  is  reduced  as  the  block  length  increases. 

In  the  next  section,  the  general  approach  for  the  roundoff  noise  analysis  will 
be  briefly  outlined.  This  analysis,  which  la  done  is  the  frequency  domain,  is  a 
very  common  technique.  In  the  following  sections,  the  roundoff  noise  analysis  in 
the  time  domain  will  be  discussed  in  detail  by  using  the  state  space  representa¬ 
tions.  The  average  noise  power  can  be  represented  by  the  filter  equation 
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directly,  and  no  transfer  function  in  the  frequency  domain  is  involved. 

5.3.  Frequency  Domain  Analysis 

The  conventional  roundoff  noise  analysis  is  done  in  the  frequency  domain. 
Assuming  white,  uncorrelated  noise  sources  at  each  internal  summing  node  ,  the 
output  noise  power  can  be  obtained  by  summing  up  the  noise  power  contributed 
from  all  the  internal  summing  nodes.  The  contribution  of  each  node  is  in  turn 
obtainable  by  multiplying  the  noise  power  at  this  node  by  the  frequency 
response  from  this  node  to  the  output  Jackson[5]  introduced  a  state  variable 
approach  to  characterize  the  noise  behavior. 

A  digital  filter  network  may  be  represented  as  shown  in  Figure  5-3.  The 
transfer  function  from  the  input  to  the  i**  branch  node  is  denoted  as  /\(z),  and 
Gj(z)  denotes  the  transfer  function  from  this  summing  node  to  the  output.  H(z) 
is  the  transfer  function  from  the  input  to  the  output  Associated  with  each  sum¬ 
ming  node  is  the  roundoff  error  generated  at  time  n,  which  is  labeled  «i(n).  This 
error  has  a  mean  square  value  Jqfff,  where  Aq  is  the  number  of  products  to  be 
summed  up  at  this  summing  node.  If  rounding  is  performed  after  summing  up 
all  the  products,  this  number  would  be  unity.  The  noise  power  generated  at  the 
output  summing  node  is  a  constant  and  usually  is  negligible  compared  to  the 
noise  generated  from  the  internal  summing  node. 

Suppose  this  network  is  properly  scaled;  hence,  the  power  gain  from  the 
input  to  each  internal  summing  node  satisfies  some  scaling  rule.  Then,  the  con¬ 
tribution  of  the  i1*  noise  source  to  the  overall  noise  power  can  be  written  as 

Since  the  noise  sources  are  uncorrelated  from  source  to  source  and  from  time 
to  time,  the  output  noise  has  a  power  spectrum 

N(o)  * 


The  variance,  or  total  average  power,  of  the  output  noise  can  then  be  obtained 
by  the  following  equation 

o'  = 

where  d,  is  the  radian  sampling  frequency  given  by  c,  =  2 n/  T.  From  the  above 
derivation,  it  is  clear  that  in  order  to  obtain  the  overall  noise  power  the  calcula- 
tion  of  the  transfer  functions  from  each  internal  summing  node  to  the  output  is 
required.  This  tedious  calculation  makes  the  noise  analysis  and  the  comparison 
between  various  structures  extremely  difficult 

S.4.  Time  Domain  Analysis 

As  mentioned  before,  the  state  space  representation  of  a  digital  filter  is 
very  desirable  for  the  roundoff  noise.  The  analysis,  as  will  be  seen  later,  is  done 
in  the  time  domain  instead  of  in  the  frequency  domain.  The  output  noise  power 
can  be  represented,  at  any  time  n.  by  the  four  matrices  in  the  state  equations. 
Thus,  there  is  no  need  to  calculate  the  transfer  functions  from  all  the  internal 
state  nodes  to  the  output  as  in  the  frequency  domain  analysis.  This  is  possible 
because  the  state  equations  completely  specify  the  filter  structure,  and  hence 
roundoff  noise  can  be  characterized  by  the  coefficients.  In  order  to  obtain  the 
steady  state  output  noise  power,  simply  look  at  the  limit  of  this  noise  power  by 
letting  the  time  index  approach  infinity. 

Barnes[33]  proved  in  his  paper  that  MIMO  filters  have  much  less  roundoff 
noise  than  their  corresponding  SISO  filters  in  the  state  space  design  Actually, 
the  noise  power  generated  from  the  recursive  operation  for  an  SISO  filter  is 
equivalent  to  the  sum  of  the  noise  power  of  all  the  samples  in  a  block  for  the 
corresponding  IflMO  filter.  Hence,  the  average  noise  power  over  one  block  out¬ 
puts  is  decreased  as  the  block  size  L  increases.  However,  all  his  conclusions  are 
based  on  the  assumption  that  the  rounding  occurs,  at  each  state  node,  after 


slimming  up  all  the  products.  Thus,  each  summing  node  has  a  noise  source 
whose  noise  power  equals  that  in  equation  (5-1).  A  wider  adder  is  then  needed  at 
each  summing  node  to  achieve  this  performance.  Zeman[23]  showed  some 
results  which  verified  Barnes'  conclusions. 

Even  if  each  product  is  rounded  before  sent  to  be  summed  up.  the  average 
noise  power  generated  from  the  recursive  operation  is  still  less  than  in  the 
corresponding  S1S0  filter.  Hence,  even  with  a  narrow  adder  at  each  summing 
node,  the  roundoff  noise  can  still  be  reduced  compared  to  an  S1S0  filter  design. 
However,  the  noise  power  generated  from  the  output  summing  node  increases 
linearly  with  L  If  this  noise  term  becomes  dominating,  a  wider  adder  is  sug¬ 
gested. 

The  coefficients  of  an  MIMO  filter  can  be  expressed  by  the  coefficients  of  its 
corresponding  S1S0  filter  (See  equation  2-32).  As  will  be  seen  later.  MIMO  and 
S1S0  filters  have  similar  formulation  for  the  output  noise  power.  Hence,  minim¬ 
izing  the  noise  power  for  the  S1S0  filter  would  also  minimize  the  noise  power  for 
MIMO  filters.  Therefore,  the  optimum  block  state  filter  design  becomes  an  S1S0 
state  filter  design. 

The  roundoff  noise  analysis  in  the  state  space  structure,  has  been  investi¬ 
gated  by  various  researchers[39, 40. 41].  The  necessary  and  sufficient  conditions 
for  an  optimal  state  space  filter  design  have  been  developed.  Mullis  and  Roberts 
proposed  a  block  optimal  filter  structure  which  has  near  optimal  noise  perfor¬ 
mance  with  much  less  computation.  Instead  of  designing  an  optimal  filter 
directly,  they  independently  optimized  each  section  of  a  parallel  and/or  cascade 
connection  of  low  order  subfilters.  Jackson  et  al,[l7],  mentioned  another  struc¬ 
ture  which  is  called  sectional  optimum.  Instead  of  optimizing  the  whole  filter  in 
a  block  form,  they  optimize  each  2"*  order  section  independently.  This  struc¬ 
ture  requires  a  very  simple  design  procedure  and  has  noise  performance  close 
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to  that  of  the  block  optimal  structure.  Above  all,  this  structure  can  also  have 
the  efficient  structure  described  in  chapter  3.  Therefore,  it  is  possible  to  realize 
a  high  speed  filter  with  localized  communication  while  lowering  the  roundoff 
noise. 

The  sectional  optimal  filter  design  problem  is  reduced  to  an  optimal  second 
order  filter  design.  Jackson  et  al.[l7]  developed  a  procedure  for  a  minimum 
noise  second  order  state  space  filter  design.  Barnes[20]  introduced  another 
special  structure  called  normal  filters  which  have  the  unifcrm-grid  structure  of 
Rader  and  Gold[42]  and  will  not  support  autonomous  overflow  limit  cyclesflS], 
Hence,  this  structure  has  a  transfer  function  which  is  less  sensitive  to  the 
coefficient  quantization  error  and  the  pole  positions.  This  is  important  for  block 
state  filters,  for  its  poles  are  more  widely  spread  and  closer  to  the  origin. 

In  this  section,  the  roundoff  noise  formulation  for  both  S150  and  M1M0  filters 
will  be  derived  in  the  state  space  design.  It  will  be  clear  that  block  filters  do 
improve  the  roundoff  noise  behavior,  once  the  noise  representation  is  developed. 
The  interaction  between  roundoff  noise  and  dynamic  range  will  be  treated  in  a 
later  section.  The  effect  of  scaling  on  the  roundoff  noise  will  also  be  discussed. 
With  a  fixed  scaling  rule,  the  conditions  for  an  optimal  filter  design  will  be  given 
and  the  two  structures  for  a  low  roundoff  noise  second  order  state  space  filter 
design  mentioned  above  will  be  derived. 

5.4.1.  Noise  Formulation  for  SI  SO  Filters 

The  state  equation  for  an  N01  order  digital  filter  is  rewitten  as  follows 

-  Arn  +  6s„  (5-2a) 

yn  e  cr„  +  ds*  (5-2b) 

where  A.  b,  c  and  d  are  respectively,  NxN,  N*  1,  1  *N  and  lxi  real  constant 

matrices.  Due  to  the  product  quantization,  the  actual  filter  implemented  by  a 

finite  word  length  machine  is 


rn+1  =  Arn  +  b=n  +  a*  (5-3a) 

yn  =  cr-n  +  cL=„  +  /9n  (&-3b) 

where  a„  is  the  noise  vector  generated  from  the  product  quantization  of  A  and  b 

at  time  n,  and  fin  is  a  scalar  noise  term  generated  from  c  and  d.  Subtracting 

(5-2)  from  (5-3),  we  have 

ArnM  =  AAr„  +  a*  (5-4a) 

Ay„=eArn+0n  (5-4b) 

where  Arn  is  the  state-error  vector  rn— r„,  and  Ay„  is  the  output  noise  at  time  n. 

The  solution  to  (5-4)  is 

Ay„  =  c  £  An-*~1aJ  +  pn 
o 

if  assume  that  Arc  =  0.  Under  the  usual  assumptions  that  the  product  quantiza¬ 
tion  errors  are  white  noise  and  are  statistically  independent  from  source  to 
source,  and  from  time  to  time,  the  crosscorrelation  between  a's  and  ps  is  zero. 
Therefore,  the  output  noise  power  at  time  n  is 

aVn2  =  E(Ay*) 

=  (ScA’W-’a/]  +  (5-5) 

t«C  j*S 

Also  since  the  autocorrelation  between  a*  and  cxj  is  zero  if  iXj,  the  above  equa¬ 
tion  can  be  written  as 

=  |i|;cAn“‘-1E[aiatrlAn“<'ircr  +  E\fi%\  (5-8) 

Since  each  rounding  introduces  a  noise  with  expected  power  a§.  both  expecta¬ 
tions  in  the  above  equation  equal  this  noise  power  multiplied  by  the  number  of 
roundings  associated  with  the  corresponding  summing  node.  Assuming  that  q* 
products  are  summed  up  at  the  i01  state  and  fi  products  are  summed  up  at  the 
output,  the  above  equation  can  be  written  as 

ov*  *  |^cA<$(eA<)r  +  fi  a§  (5-7) 

where  Q  -  diag (q  j.q*  •  •  •  ,q.v) 
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Usually  n  is  a  constant  and  is  much  smaller  than  the  other  summation  term,  and 
hence  Mill  be  ignored  in  the  following  analysis.  For  a  balanced  state  equation 
where  all  states  have  the  same  number  of  products  to  sum  up,  the  matrix  Q  in 

equation  (5-7)  can  be  replaced  by  a  scalar  q  =  9v  i=  1.2 . N.  The  steady  state 

noise  power  is  the  noise  in  (5-7)  when  n  -*  <■». 


I  =  lim  0yn  =  tj£  cAnCAn7  +  At  a c 
n*0  J 


5.4.2.  Block  State  Structures 

Given  a  properly  scaled  realization  (A,  B,  C,  D)  in  a  block  state  form,  the 
output  noise  vector  can  be  obtained  similar  to  equation  (5-B). 


&*=[/  +  £  CAn(CAn)T 
l  n*0 


where  I  is  a  column  vector  whose  elements  are  all  l’s.  Assume  only  one  round¬ 
ing  is  performed  at  each  summing  node  for  every  input  block.  If  this  block  state 
realization  is  related  to  the  single  state  realization  (A,  b,  c,  d)  by  equation  (2- 
32).  the  i,h  noise  component  cam  be  represented  as 


el  -  (1  +  £  cAZn+<-lA^+<-l)rc  T)ffi  i=l,2 . L 

»l*0 

where  L  is  the  block  size.  If  define  the  average  noise  power  as 

<»i 

it  is  easily  seen  that  the  average  noise  power  is 


(5-10) 


(5-11) 


1  +  7-£;cAnG4n)rcr  (5-12) 

The  noise  power  gain,  which  is  the  infinite  sum  in  the  above  equation,  is  l/L  of 
the  noise  power  gain  in  the  scalar  case.  It  is  obvious  that  the  average  noise 
power  involves  only  the  coefficients  in  its  corresponding  scalar  state  filter. 
Thus,  minimizing  the  roundoff  noise  power  in  the  scalar  case  would  also  minim¬ 
ize  the  roundoff  noise  power  in  the  block  case.  In  the  next  section,  we  are  going 
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to  discuss  the  optimal  filler  in  the  scalar  case,  considering  also  the  scaling 
effect. 

5.5.  Optimal  SI  SO  Filter  Synthesis 

From  equation  (5*5),  the  steady  state  noise  power  consists  of  two  terms. 
The  first  one  is  generated  from  the  recursive  operation,  whereas  the  second  one 
from  the  output  summing  node.  The  second  term  is  a  constant  term  and  usually 
is  much  smaller  than  the  first  one,  especially  for  a  narrow  band  filter.  Further¬ 
more,  for  a  given  filter  in  the  state  space  design,  nothing  can  be  done  to  reduce 
this  term  to  improve  the  overall  noise  performance.  However,  for  the  first  term, 
its  value  varies  with  the  choice  of  states.  Hence,  the  overall  noise  minimization 
is  equivalent  to  the  minimization  of  this  term.  In  the  following  analysis,  we  will 
refer  the  overall  noise  to  the  first  term  only,  and  will  find  criteria  for  a  minimum 
noise  filter  synthesis. 

Due  to  the  finite  register  length,  the  internal  node  of  the  filter  might 
overflow  if  the  input  samples  are  too  large.  The  scaling  of  filter  coefficients  is 
usually  necessary  in  order  to  keep  the  registers  from  overflowing,  However,  this 
scaling  usually  also  changes  the  behavior  of  the  roundoff  noise.  Therefore,  the 
scaling  has  to  be  considered  before  discussing  the  roundoff  noise  behavior. 
Before  discussing  the  scaling  effect,  define  two  symmetric  matrices  as  follows 

K- AKA7  +  bbr=  204*b)C4*6)r  (5-l3a) 

*»o 

W  =  AtWA  4-  crc  =  £  ( cAk)T(cAk )  (5- 13b) 

**c 

Since  the  summation  term  in  equation  (5-6)  is  a  scalar,  it  can  be  rewritten  as 

al  =  lim  <3V  2  -  E  £ (cAiax)r{cAiai ) 
n— " 

For  a  given  state  space  structure,  the  autocorrelation  of  a*  is  invariant  in  time 
and  with  correlation  matrix 


EfoaT]  =  Qo§ 

Hence,  the  above  equation  can  be  rewritten  as 

o\  -  E  aTf)(c^)T(cyl<)ft  =  Eiar&' a] 

*»o  1  1 

=  Tr(WQ)  =  £qiWiiai  (5-U) 

«i 

Tr(  )  is  defined  as  the  sum  of  the  diagonal  elements  of  its  matrix  argument. 

From  (5-14),  it  is  obvious  that  the  total  noise  depends  on  the  diagonal  ele¬ 
ments  of  the  matrix  Y\  Now.  the  effect  of  scaling  on  these  diagonal  elements  will 
be  investigated.  Before  doing  that,  let’s  look  at  the  effect  of  linear  transforma¬ 
tion  on  this  matrix.  From  the  definition  of  K  and  Yi  and  Theorem  2-1,  it  is  obvi¬ 
ous  that  any  transformation  T  will  map  (K.YT)  to  (K\  f>”)  =  ( T~}KT~r ,  TtWT).  It  is 
also  easily  seen  that  the  diagonal  element  IQ.  is  actually  the  lz  norm  of  the  gain 
from  the  input  to  the  Xth  state.  According  to  the  conventional  f2  norm  scaling 
rule,  the  gain  from  the  input  to  each  internal  summing  node  should  be  unity. 
This  is  equivalent  to  setting  all  the  diagonal  elements  of  matrix  K  to  unity. 
Hence,  the  transformation  matrix  T  should  be  a  diagonal  matrix  with  each  ele¬ 
ment 

Ta  -  /IQ 

This  matrix  transforms  the  diegonel  elements  of  W’  to 

>V=  IQ 

So.  after  scaling,  the  output  noise  power  becomes 

cv*=  (5-15) 

For  the  case  of  a  full  matrix  A,  =/^+ 1  for  all  i,  and  hence 

ova=  (N+l)a§t^afQi  (5*16) 

<»i 

Mullis  and  Roberts  proved  in  their  paper[40]  a  very  important  theorem 


which  is  stated  as  follows 


<» 


» 


» 


» 
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Theorem  5-1 

Let  K  and  W  be  two  nxn  real,  symmetric,  positive  definite  matrices.  Then 


*Ma 

where 


(5-17) 


*4  = 

nt«l 

and  the  numbers  j/if . fj&  j  are  the  eigenvalues  of  the  product  KW.  In 

order  for  equality  to  hold,  it  is  necessary  and  sufficient  that 
CiiDo'KDo  '^Vcfi'Vc  for  some  diagonal  matrix  Dz- 
C2 :  Kn  Wu-Kjj  Rj;  for  all  i.j. 

Tne  minimum  output  noise  variance  is  therefore 


<r|  =  N(K+l)Mac£ 

and  this  minimum  noise  can  be  obtained  when  the  equality  of  equation  (5-17) 
holds  or  equivalently,  when  a  diagonal  matrix  Do  can  be  found  to  satisfy  condi¬ 
tions  Cx  and  C2. 

In  the  next  subsection,  a  minimum  noise  second  order  filter  will  be  derived 
according  to  the  above  theorem.  This  filter  design  can  be  used  to  synthesize  a 
high  order  filter  which  is  section  optimum.  The  normal  filter  formulation  will  be 
presented  afterwards.  Although  this  filter  has  slightly  higher  roundoff  noise 
than  the  optimal  filter,  it  has  uniform  sensitivity  to  the  coefficient  truncation 
over  the  entire  z-plane  and  does  not  support  autonomous  limit  cycles. 


5.5. 1.  Optimal  Second  Order  Filter  Synthesis 

From  Theorem  5-1,  the  necessary  and  sufficient  conditions  for  an  optimal 
filter  subject  to  the  l2  norm  scaling  are 

W  =  DKD  (5-18a) 

and  Kn  IK*  *  %  for  all  t  J  (5-lBb) 

where  D  is  a  diagonal  matrix  Since,  the  l2  norm  scaling  requires  A*=l  for  all  i. 


(5-18b)  then  becomes 


F>’u  =  W}j  =  1  for  ell  i,j 

Since  the  diagonal  elements  of  W  are  equivalent  to  the  norm  of  the  gain  from 
the  corresponding  internal  summing  nodes  to  the  output,  the  optimal  network  is 
characterized  by  having  equal  noise  contributions  from  each  error  source. 

The  above  conditions  are  satisfied  if,  and  only  if,  D=pl:  and  thus,  an  alter¬ 
nate  condition  which  is  both  necessary  and  sufficient  for  optimality  is  simply 


W  -  p*K 

In  the  second  order  case,  (5-19)  is  not  changed  by  writing  it  as 


W  -  pzMKM 

fn  i! 


(5-19) 


(5-20) 


where 


because  both  K  and  V  are  symmetrica!  matrices  with  equal  diagonal  elements. 
From  (5-13a,b)  it  is  readily  seen  that  (5-20)  can  be  rewritten  as 

AtWA  +  c7c  =  pz!d(AXA7+bT)M 

-  pzMAICA  7 Id  +  pzldbb  7  Id 
=  MAIdWIdA7  Id  +  Idpb(Idpb)7 
=  (J5MJ5?)  W(!dAId)T  +  pldb  (pldb  )T 
Hence,  (5-20)  is  satisfied  by  a  network  of  the  form 

A7  -  MAI{ 

and  cT-plJb  (5-21) 

For  complex  conjugate  poles,  (5-21)  can  always  be  satisfied  with  real-valued 

coefficient  matrices. 

Suppose  the  coefficients  have  the  following  form 


°n  al2  > 

A  ;  (/Si 

®2L  ®22  l“2 

®  =  |*  !>®z] 


(5-21)  states  simply  that 


an  =  022 
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and  T~ =  r^-  (5-22) 

o2  o  i 

7his  network  can  be  synthesizeed  from  an  arbitrary  realization  (A.b,c,d)  as  fol¬ 


lows.  From  (5-21)  if  the  transpose  of  the  optimal  network  is  formed  and  the 
states  r  j  and  r2  are  interchanged,  the  resulting  network  is  identical  with  the  ori¬ 
ginal  optimal  network  except  for  scaling.  However,  if  we  form  the  network 
(ILVM.McV^.i^M.d/^)  and  place  it  in  parallel  with  the  network  (A,b/2,c,d/2).  an 
overall  network  with  the  above  property  and  the  same  transfer  function  can  be 
produced.  Therefore,  the  optimal  network  can  be  synthesized  6imply  by  merg¬ 
ing  these  two  parallel  networks  into  a  single  network  (£,5\c,d)  and  then  scaling 
according  to  the  Z2  norm  scaling  rule. 

Suppose  the  filter  has  a  transfer  function 


H(z)  ~  d+ 


y3z~2+yxz~l 

e2z-z+{!xz-'-n 


and  is  implemented,  for  example  by 


c  =  [ri  7z] 

The  coefficients  of  (A.F.r.d)  are  then 


®11  s  ®zs 


-  £i_ 

2 


P  l  =  ^l+7z)  =  J7i 

_  -  71  r  -  1 


-  7r*(l+72)  p~fTsi7iiV7z-7i7z^»+7^2 


» 21  =  ( 1 +72)'1- 172-  2^171^  V  7|-7i72.Si  +7i^2j 
The  expression  under  the  radical  in  ftl2  and  a21  is  always  positive  for  complex- 
conjugate  poles,  and  hence  the  coefficients  in  the  above  equations  are  all  real 
valued.  After  scaling,  the  resulting  second  order  network  is  optimal. 


•-  V  V- V 


.  ,.„V 


5.5.2.  Normal  Digital  Filters 

Barnes  and  Fam.  suggested  a  special  structure  in  the  state  space  design, 
namely  a  normal  digital  filter  which  has  uniform  sensitivity  over  the  entire  z- 
plane  to  the  coefficient  quantization  and  does  not  support  the  autonomous 
overflow  limit  cycles.[l8].  Another  advantage  of  this  normal  structures  is  that 
the  expression  noise  is  explicit,  whereas.  Mullis  and  Roberts  and  Kwang  only  pro¬ 
vided  a  numerical  approach  for  the  solution  of  minimum  noise.  They  also 
showed,  within  this  context,  a  way  to  minimize  the  roundoff  noise  assuming  sim¬ 
ple  poles  only.  A  realization  is  called  normal  if  and  only  if  its  state  matrix 
satisfies  the  following  relation 

A  A' -A*  A 

where  A*  is  the  complex  transpose  of  matrix  A. 

Suppose  the  filter  to  be  implemented  has  a  transfer  function 

H(z)  -  d  +  ax(*-Xx)*1  +  az{z-\z)-1  (5-23) 

The  filter  can  always  be  minimally  realized  in  the  state  space  domain  as 

*n«-i  =  Arn  +  fcijj 
Vn  =  "n  +  d=n 

where  A  is  a  2x2  matrix  with  eigenvalues  Xj  and  Xj.  It  is  well-known  that  normal 
matrices  posses  orthonormal  eigenvectors  !?nj,  such  that 

Atfn-Krn  where  n=1.2 

fnVm  =  <*nm 

Thus,  we  can  expand  A.  b  and  c  in  the  form 


(5-24) 

c  =  i,7n<fn 

nm  1 

where  A*  is  the  conjugate  transpose  of  the  matrix  A.  In  terms  of  the  expansion 
coefficients,  the  resulting  system  transfer  function  is  given  by 
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where 


s  _  (&  ?  — b  | )— p2K&  1  -bf)cos(2y)-2b  ib£sin(2-3)j 
l+p4-2p2cos(2tf) 


In  order  to  satisfy  the  scaling  condition  (5-27),  we  must  have 


s  =  0 

b ?  +  6|  -  2(l-p2) 


(5-25a) 

(5-28b) 


(5-2Sa)  implies  that 


_  p2sin(2v)=}l+p4-232c3sf2v){1/2  /, 

bj  "  l-p2cos(2y)  1  ' 

(5-29)  and  (5-28b)  determine  both  bj  and  bz.  Since  ?i's  and  bt's  are  known,  the 


values  of  ft  and  ft  are  immediately  available.  Substitute  ft's  and  (5-20)  into  (5- 
19),  the  vector  c  can  be  obtained  as  follows 

_  b|(a1+Qg)  +  b;(;'o1-;az) 

1_  2(l-p2) 

_  b,0‘ai-;az)+bz(Q,+oz) 

- W7) 

Since  ^andcz  are  real  values. 


5.5.3.  Examples 

Jackson,  Lindgren  and  Kim[l7]  compared  the  roundoff  noise  performance 
among  various  structures.  This  comparison  is  based  on  the  measurement  of  the 
noise  power  gain  which  is  defined  as  the  ratio  of  the  output  noise  power  gen¬ 
erated  from  the  recursive  operations  to  a  unit  noise  power  which  is  defined  as  in 
equation  (5-1).  For  S1S0  filters,  this  power  gain  is  defined  as  the  infinite  summa¬ 
tion  term  in  equation  (5-3).  Table  5-1  shows  the  noise  power  gain  for  five 
different  filters  each  of  which  is  implemented  in  a  parallel  form.  The  second 
column  shows  the  noise  power  gain  if  each  parallel  section  is  implemented  by 
Z"'4  order  canonical  form.  The  normal  and  sectional  optimal  structures  are 
shown  in  columns  3  and  4  respectively.  In  this  parallel  design,  the  block  optimal 
is  the  same  as  the  sectional  optimal  design. 


Table  5-2  shows  these  five  filters  with  cascade  form  designs.  The  block 


Filter 

Canonical 

Normal 

Sect.-Oot. 

Chebychev  II 
BRF.  N=6 

13.8 

11.0 

10.9 

Chebychev 1 
LPF,  N=10 

19.2 

14.0 

13.8 

Elliptic 

LPF.  N=10 

16.9 

13.7 

13.5 

Butterworth 
LPF.  N=S 

14.2 

14.6 

13.4 

fc  -  0.25  /, 

Butterworth 
LPF.  N=S 

27.0 

14. S 

13.4 

fe  =  0.025  /, 

_ 

Table  5-1  Noise  Power  Gain  (in  dB)  for  Parallel  Form  Designs 


optimal  form  is  not  quite  the  same  as  the  sectional  optimal  form  and  hence  is 
also  tabulated. 


Filter 


Chebychev  11 
BRF,  N=6 


Chebychev  1 
LPF,  N=10 


Elliptic 
LPF.  N=10 


Canonical  i  Normal  i 

21.0 

10.5 

29.9 

24.2 

21.5 

16.9 

9.2 

11.0 

19.7 

10.3 

Table  5-2  Noise  Power  Gain  (in  dB)  for  Cascade  Form  Designs 


From  these  two  tables,  the  canonical  form  is  worse  than  all  the  other  forms  in 
almost  all  the  cases.  On  the  other  hand,  normal,  sectional  optimal  and  block 
optimal  are  quite  close  to  one  another. 


S.6.  Conclusions 


The  effects  of  finite  register  length  have  been  discussed  in  this  chapter. 
These  effects  would  affect  the  complexity  of  the  filter  structure  and  hence  are 
critical  factors  in  the  actual  filter  implementation.  Block  state  filters  are  pro¬ 
ven  to  have  better  roundoff  noise  performance  than  their  corresponding  scalar 
state  filters.  Due  to  the  similarity  in  the  noise  formulation  between  both  cases, 
designing  a  minimum  noise  scalar  state  filter  would  automatically  minimize  the 
block  state  filters  by  using  equation  (2-32)  to  transform  the  scalar  filter  into  a 
biock  filter.  This  relationship  greatly  simplifies  the  design  procedure  for  design¬ 
ing  a  low  noise  block  state  filter.  The  section  optimum  structure  can  simplify 
the  design  process  even  further,  since  only  the  second  order  filter  need  be  con¬ 
sidered.  A  filter  of  any  order  can  be  easily  synthesized  from  the  second  order 
filters.  Two  different  approaches  for  low  noise  second  order  filter  design  are  also 
derived. 

Simulation  results  confirming  the  theoretical  results  here  will  be  given  in 
the  next  chapter. 
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CHAPTER  6 

SIMULATION  RESULTS 


In  this  chapter,  simulation  results  will  be  presented  to  verify  the  analysis  of 
the  previous  chapters.  The  speed  performance,  including  the  delay  effect  and 
latency,  of  all  the  structures  will  be  shown  first.  The  roundoff  behavior  of  the 
block  state  structure  will  then  follow.  The  test  filter  is  a  six-order  butterworth 
filter  using  bilinear  transformation  which  has  a  transfer  function  as  in  equation 

<6-l)[lJ. 

V  _  _ 0.0007379(l+z~1)8 _ 

'  ~  <1-1.2636* "1+0.7G51z‘j)(1-1.01052“1+0.3533z-z) 

(1-0.SQ44Z-1 +0.2 155*  ~2)  ^ 

The  programs  are  written  in  'C'  and  run  on  a  VAX 11-730. 

6.1.  Speed  Simulation 

A  special  C  compiler  'simcc'  is  provided  to  work  cooperatively  with  SIMON*  to 
obtain  the  execution  time  of  each  task.  SIMON  is  a  simulation  program  which 
executes  the  application  programs  as  if  they  are  running  on  a  multiprocessing 
aystem.  To  ensure  that  no  PE  will  receive  any  future  data,  which  might  be  avail¬ 
able  for  a  simulator  running  on  a  single  processor,  SIMON  has  a  global  clock  to 
keep  track  the  elapse  time  of  each  PE.  'simcc*  inserts  instructions  into  the 
assembly  language  routine  to  accumulate  the  execution  time  of  each  instruc¬ 
tion.  If  the  programs  of  a  process  are  compiled  with  the  regular  compiler  'cc', 
no  execution  time  is  counted.  Therefore,  this  process  can  run  virtually  infinitely 
fast[2].  This  is  done  for  the  simulation  overhead,  such  as  reading  and  writing 
data  from  and  to  the  files.  If  compiling  reading  and  writing  with  regular  com¬ 
piler.  the  speed  limitation  of  the  computer  I/O  will  not  affect  the  actual  filter 
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speed  simulation,  since  the  input  and  output  data  are  transmitted  infinitely  fast. 
In  the  following  simulations,  the  execution  time  of  all  the  instructions  is 
assumed  to  be  1  jis. 

In  this  chapter,  the  simulation  results  for  speed  performance  and  transmis¬ 
sion  delay  affect  on  both  Barnwell's  and  the  Block  state  structures  will  be  shown. 
The  cascade/parallel  form  and  scalar  systolic  array  approaches  are  straightfor¬ 
ward.  The  structures  are  fixed  with  a  given  filter  equation;  hence,  it  is  impossi¬ 
ble  to  increase  the  throughput  rate  by  changing  the  number  of  PE's.  Block  I/O 
filters  are  similar  to  block  state  filters  but  not  so  attractive  as  far  as  communi¬ 
cation  complexity  is  concerned.  Therefore,  these  structures  will  not  be  dis¬ 
cussed  here. 

6.1.1.  S3.CD  Filter 

In  this  section,  we  show  the  speed  performance  of  Barnwell's  algorithm 
applied  to  the  filter  equation  above  with  a  varying  number  of  PE's.  The  program 
structure  is  shown  in  Figure  6-1  for  the  case  of  10  PE's.  Processor  1  sends  out 
512  sample  points  and  PE  12  collects  all  the  output  samples  from  PE’s  2  to  11 
and  then  puts  them  into  the  right  order.  Programs  in  PE's  1  and  12  are  com¬ 
piled  with  regular  compiler;  hence,  they  work  as  if  they  can  finish  the  whole 
function  in  no  time  at  all.  Thus,  the  overall  speed  is  limited  by  the  filter  function 
but  not  by  the  data  input  and  output  The  overall  execution  time  is  obtained  by 
recording  the  arrival  time  of  the  first  and  the  last  sample  in  PE  12,  and  taking 
the  difference  as  the  overall  execution  time. 

Figure  6-2  shows  the  speedup  ratio  vs.  the  number  of  PE’s.  This  ratio  is 
obtained  by  comparing  the  speed  to  that  of  only  one  PE.  The  straight  line 
represents  the  ideal  case  where  the  speedup  ratio  increases  linearly  with  the 
number  of  PE’s  with  a  unity  slope.  When  the  number  of  PE’s  is  below  10,  the 
speedup  ratio  increases  linearly  but  with  a  lower  slope.  From  10  to  11,  the  rate 


Figure  8-1  Program  Structure  of  a  SSBCD  Filter  with  10  PE' 
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of  increase  in  speed  slows  down  and  saturates  after  11.  Therefore,  10  PE’s  are 
PE  optimum  and  11  is  speed  optimum  as  mentioned  in  chapter  4.  This  figure 
verifies  the  summarized  results  stated  in  section  4-1. 

Figure  6-3  shows  the  effect  of  transmission  delay  on  the  speed  performance. 
The  results  are  obtained  by  using  10  PE's  whose  structure  is  shown  in  Figure  6-1. 
The  speed  with  null  transmission  delay  is  used  as  the  reference  and  set  to  unity. 
A  constant  delay  for  each  message  transmission  is  simulated  for  the  range  from 
1  ns  to  50  fJS •  The  speedup  drops  below  0.2  when  the  delay  is  50  ns-  Better 
simulation  results  can  be  found  in  R.  Fujimoto's  dissertation[43]  considering 
various  type  of  interconnections  as  well  as  a  finite  bandwidth  assigned  to  each 
communication  link.  The  effect  of  multi-cast  capability  of  each  transmission 
link  is  also  investigated. 

6.1.2.  Block  State  niters 

The  filter  described  by  equation  (6-1)  is  simulated  in  the  block  state  form 
with  various  block  sizes.  Figure  6-4  shows  the  program  structure  of  the  filter 
with  block  size  L=6.  Lines  and  arrows  represent  the  actual  data  flow  in  the  pro¬ 
gram.  The  floating  lines  and  arrow’s  are  fed  into  a  dummy  data  sink,  which  does 
not  have  any  function  and  does  not  take  any  time.  These  floating  lines  and  the 
dummy  data  sink  are  necessary  so  as  to  make  all  the  cells  similar,  and  hence 
only  a  single  routine  is  required  for  the  whole  simulation  In  the  actual  imple¬ 
mentation,  they  may  not  be  necessary.  Task  1  reads  input  samples  from  an 
input  file  and  then  distributes  them  into  several  other  tasks.  It  also  sends  ’O'  to 
all  the  tasks  which  need  it  Task  29  takes  final  data  from  the  output  of  matrix  C 
and  then  writes  into  an  output  file. 

Tasks  2  to  10  perform  the  matrix  multiplication  of  matrix  B.  Each  task  per¬ 
forms  a  two  by  two  matrix  multiplication.  The  y  output  travels  horizontally  to 
matrix  A  and  the  x  input  travels  dow-nward  to  the  data  sink.  Tasks  11,  12  and  13 
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perform  the  recursive  operation  and  then  add  the  results  to  the  inputs  from  B. 
Their  outputs  are  sent  to  matrix  C  as  the  input  vector.  The  initial  y  inputs  come 
from  the  output  of  matrix  D  and  the  output  samples  travel  vertically.  Since 
matrix  D  is  lower  diagonal,  only  six  tasks  are  required  instead  of  nine.  Further¬ 
more.  the  tasks  on  the  diagonal,  which  are  23.  23  and  23,  contain  only  three 
non-zero  coefficients. 

Only  two  routines  are  written  for  all  the  tasks.  The  function  in  tasks  11.  12 
and  13  are  slightly  different  from  the  others,  and  hence  their  programs  are 
separate.  Since  only  one  program  is  written  for  all  the  tasks  in  matrices  B.  C 
and  D.  they  should  have  exactly  the  same  operation  and  1/0  connection.  SIMON 
does  not  allow  floating  links.  Each  export  FIFO  must  have  a  corresponding 
import  FIFO.  Therefore,  a  sink  is  required  to  receive  data  from  the  floated  out¬ 
put  ports  in  order  to  make  SIMON  working  properly. 

Figure  6-5  shows  the  speedup  ratio  of  this  filter  structure  with  block  sizes 
from  2  to  10.  The  relative  speed  is  measured  against  the  speed  of  a  single  PE 
Barnwell  structure.  The  speedup  ratio  is  plotted  against  the  block  size.  The 
corresponding  number  of  PE  from  each  block  size  is  also  shown.  It  is  obvious 
that  this  structure  is  not  so  efficient  as  Barnwell's  structure.  However,  this 
structure  does  not  saturate  in  speed  when  the  number  of  PE  increases.  If  the 
block  size  increases  further  the  speedup  ratio  curve  would  go  up  with  the  same 
slope  without  any  limit.  Hence,  it  is  verified  that  block  state  filters  can  achieve 
any  desired  speed. 

The  effect  of  the  transmission  delay  for  block  state  filters  was  also  simu¬ 
lated  for  the  case  of  a  constant  delay  on  each  link.  This  delay  is  less  likely  to 
happen  than  in  Barnwell  structure,  because  of  the  local  interconnect.  The  simu¬ 
lation  is  done  with  block  size  10  and  the  speed  with  null  transmission  is  used  as  a 
reference.  The  overall  speed  drops  slightly  when  the  delay  increases  to  50  /is 


(See  Figure  6-5).  This  speed  drop  is  due  to  the  fact  that  the  program  for  the 
recursive  operations  is  different  from  the  program  for  the  rest  of  the  cells. 
Hence,  the  execution  time  is  not  a  constant  for  all  the  cells.  Since  SIMON  is  an 
event  driven  simulator,  and  there  is  no  global  clock  to  drive  the  function  of  each 
cell,  the  uneven  execution  time  might  come  into  effect  when  the  delay  is  large. 
In  the  actual  implementation,  if  a  global  clock  is  used,  all  the  cells  have  the 
same  execution  time,  and  the  overall  speed  will  not  drcp  with  high  transmission 
delays. 

6.2.  Roundoff  Noise  Simulation 

Due  to  its  random  behavior,  roundoff  noise  is  not  easy  to  simulate  on  a  digi¬ 
tal  computer  directly.  However,  with  a  little  approximation,  fairly  reasonable 
results  can  be  obtained.  We  will  describe  in  this  section  our  approach  to  the 
simulation  and  compare  the  results  with  the  theoretical  ones  obtainable  from 
equation  (5-9). 

Excited  by  a  perfect  sine  wave,  a  filter  should  generate,  at  its  output,  a  sine 
wave  of  the  same  frequency  plus  some  roundoff  noise.  Observing  this  output 
waveform  for  a  long  enough  time,  we  should  be  able  to  measure  the  output  noise 
power  by  looking  at  the  output  spectrum.  It  is  simply  the  overall  output  power 
minus  the  power  of  the  output  sine  wave. 

Before  applying  the  above  technique,  however,  two  problems  have  to  be 
resolved.  First,  a  perfect  sine  wave  is  not  obtainable  on  a  digital  computer.  The 
input  signal  has  to  be  quantized  to  fit  on  finite  length  registers  or  memories. 
Therefore,  the  output  noise  contains  filter  roundoff  noise  as  well  as  the  input 
quantization  noise  passing  through  the  filter.  Since  the  input  precision  is  usu¬ 
ally  close  to  that  of  internal  samples  in  most  applications,  these  two  noise  com¬ 
ponents  are  too  close  to  be  distinguished  from  each  other  simply  by  looking  at 
the  output  spectrum.  One  way  to  remedy  this  problem  is  to  assign  more  bits  to 


the  input  samples  than  to  the  internal  and  output  data  This  can  give  us  a 
nearly  perfect  sine  wave  and  hence  the  input  error  has  very  little  effect  on  the 
output  noise. 

Second,  in  order  to  have  good  results,  the  output  noise  has  to  be  stationary. 
For  the  block  state  filter,  the  output  noise  is  periodic  stationary  with  the  period 
of  the  block  size.  Hence,  it  is  more  reasonable  to  separately  measure  the  noise 
power  for  each  component  in  a  block.  The  noise  power  of  the  ith  sample  is 
measured  by  observing  the  spectrum  of  the  sequence  formed  by  the  Ith  samples 
of  all  the  blocks.  In  order  to  ensure  that  the  decimated  output  signal  is  a 
sinusoid,  a  higher  sampling  rate  is  required  for  the  input  signal.  Furthermore, 
the  number  of  different  input  values  has  to  be  large  enough  that  the  filter 
experiences  rounding  at  all  different  signal  levels.  This  has  to  be  true  so  as  to 
satisfy  the  assumption  that  the  noise  is  uncorrelated  from  sample  to  sample. 

With  all  the  above  restrictions  in  mind,  we  will  introduce  some  mathemati¬ 
cal  basis  to  perform  the  simulation  and  then  show  the  actual  structure  of  the 
simulation.  Finally,  the  simulation  results  of  the  test  filter  for  various  block 
sizes  will  be  shown.  The  filter  coefficients  are  obtained  by  grouping  second 
order  normal  digital  filters  described  in  section  5.5.2. 

6.2.1.  Mathematical  Formulation 

Duttweiler  and  Messerschmitt[44]  have  shown  a  method  to  test  A/D  and  D/A 
converters  with  digitally  generated  sinusoids.  Their  algorithm  is  also  suitable 
for  the  roundoff  noise  measurement. 

6.2. 1.1.  Structure 

Suppose  the  input  sequence  is  represented  as 

xk=  A  sin(2ff f*kT)  (£-2) 

where  /<  is  the  frequency  of  this  sine  wave  and  T  is  the  sampling  period,  lhe 


output  sequence  {t/ki  is  also  &  sampled  sine  wave  with  frequency  /a  but  with 
different  amplitude  and  phase  plus  a  noise  term 


y*  *  B  sin(2r:/((ir+'i5)  +  rik  (6-3) 

Suppose  N  samples  of  a  sine  wave  are  observed.  These  samples  must  con¬ 
tain  an  integer  number  of  cycles  in  order  to  have  a  single  line  spectrum  at  the 
desired  frequency.  This  is  equivalent  to 

f<tNT  =  I.I 

for  some  integer  M.  Hence,  the  available  frequencies  are  of  the  form 


f  _  Jd_ 

Ji  -  NT 


Substituting  (6-4)  into  (6-2)  gives 


xt  -  A  sin(2T7"fc/ Ar) 

Furthermore,  in  order  to  have  as  many  different  values  as  possible,  it  is  desired 
that  there  is  only  one  cycle  in  these  N  samples.  This  is  equivalent  to  constrain¬ 
ing  the  two  integers  N  and  M  to  be  relatively  prime.  Another  constraint  is  that. 

/a  has  to  be  less  than  -y- so  as  to  satisfy  the  nyquist  sampling  criterion.  FVom 

equation  (6-4). 


or 


ir--’'T*hT 


2 


6.2. 1.2.  Noise  Power  kTeasurement 

Taking  the  DFT  of  the  output  sequence  yk  gives 

km  C 

The  total  power  at  all  frequencies  except  the  power  at  the  sine  frequency  and  dc 
is  naturally  called  noise  power  (NP).  Hence. 


According  to  the  Parsevai's  theorem. 


/url  „  i 

Zvi=jrZ 

nO  ^  Jt*2 

Replace  this  equation  into  (6-5),  we  get 

\no)  j*-2| iwj 


(6-6) 


For  a  block  state  filter  with  block  size  L,  we  have  to  modify  equation  (6-a)  as 

xk  =  A  sLn(2rr/ri7;  /  AX ) 

where  M  and  NL  should  be  relatively  prime.  For  the  ltK  term  in  the  block,  the 
output  sequence  looks  like 

p—  i 

yk-B  sin {'"££•  '  »  *)  +  nkL*i 


_  _  .  ,  2rr Mi:  2rf.fl  .  ..  . 

=  £  sm(  A-r  -  +  -Tvf-+  v)  +  "*£♦* 

2r  Vr* 

-  J?  sin(— ~-+  v,)  + 


(6-7) 


where  iJj  = 


2rf.r. 

AX 


■v.  The  phase  angle  i5j  is  constant  for  the  Ith  term,  and  hence 


will  not  contribute  to  the  output  power  measurement.  Therefore,  equation  (6-6) 
can  still  be  used  to  calculate  the  Ith  noise  power  in  a  block  but  where  the  fre¬ 
quency  is  deemed  to  be  ^-rather  than 


6.2.2.  Simulation  Routines 


The  high  precision  of  the  input  samples  makes  the  simulation  a  little  com¬ 
plicated  since  it  may  happen  that  no  existing  fixed  point  variable  can  hold  the 
even  higher  precision  of  the  products  of  the  input  samples  and  the  stored 
coefficients.  Kence.  a  special  variable  has  to  be  created  in  order  to  perform  the 
high  precision  multiplication.  Fortunately,  the  ”CM  program  provides  an  excel¬ 
lent  tool,  which  is  called  struct  to  solve  this  problem.  Then,  programs  to 


perform  the  high  precision  multiplication  and  addition  have  to  be  written. 
Finally,  a  rounding  routine  is  required.  When  these  routines  are  done,  the  noise 
simulation  is  straightforward  in  conjunction  with  the  filter  and  the  FFT  routines. 

6.2.2.I.  High  Precision  Multiplier  and  Adder 

To  generate  a  near  perfect  input  sine  wave,  more  bits  must  be  assigned  to 
the  input  samples  than  the  filter  coefficients  and  output  samples.  In  the  simula¬ 
tion  program,  32  bits  are  assigned  to  the  input  samples  and  16  bits  to  the 
coefficients  and  output  data.  Tne  problem  arising  from  this  high  precision  is 
that  the  fixed  point  multiplier  requires  47  bits  to  hold  the  precision  of  its  output. 
This  much  higher  precision  is  impossible  to  simulate  on  a  VAX  machine  directly. 
Therefore,  subroutines  are  written  to  emulate  the  long  number  multiplication. 

The  output  from  the  multiplier  is  stored  in  an  array  of  three  16-bit  short 
integers  which  are  grouped  into  a  "struct"  called  long_pu. 

typedsf  struct  $ 
short  t[3]; 

|  longjiu; 

Due  to  the  precision  limitation  on  VAX  machines,  the  input  samples  cannot  be 
multiplied  by  the  filter  coefficients  directly.  Therefore,  the  input  samples  are 
divided  into  two  16-bit  short  integers  and  then  multiplied  by  the  coefficient.  The 
two  products  are  summed  up  after  shifting. 

Suppose  the  input  sample  x  is  divided  into  x i  and  x£l  where  xx  contains  the 

* 

leftmost  16  bits  and  xz  contains  the  rest.  It  is  obvious  that  the  most  significant 
bit  of  xz  is  no  longer  a  sign  bit  Let  us  store  these  two  16-bit  numbers  in  two  32- 
bit  integers  xj  and  xj  respectively.  The  left  16  bits  of  xj  are  all  0’s  and  the  left 
16  bits  of  xj  are  the  same  as  the  Most  Significant  Bit  (MSB)  of  xs  (This  is  called 
sign  extension).  Then  * 

x  *  2ie  x\  +  xj 

Suppose  the  coefficient  is  "a"  and  the  product  is  y  =  ax  Since  a  Is  also  a  short 


integer,  we  can  obtain  y  as 


y  =  218  cl=;  +  ax'z 

This  is  equivalent  to  shifting  the  product  ar|  16  bits  to  the  left  and  then  adding 
it  to  or 2. 

As  for  the  addition  of  two  numbers,  care  must  be  taken  when  adding  two 
products  of  different  lengths.  The  product  of  the  recursive  state  variables  and 
the  coefficients  has  only  31  bits  instead  of  47  bits.  When  adding  up  a  47  bit 
number  to  a  31  bit  number,  the  most  significant  bit  must  be  lined  up  rather 
than  the  least  significant  bit.  This  can  be  done  by  storing  the  31  bit  integer  into 
the  left  two  short  integers.  t[l]  and  t[2],  of  long_jiu  and  setting  t[0]  to  0.  Similar 
to  multiplication,  the  sign  bits  of  the  right  two  short  integers  are  not  sign  bits 
any  more.  When  added,  these  two  bits  must  be  treated  as  regular  bits  rather 
than  sign  bits. 

6  2.2. 2.  Rounding  Routine 

Rounding  can  be  done  by  a  simple  truncation  after  adding  a  bit  "1"  to  a 
proper  position.  If  the  leftmost  16  bits  are  to  be  preserved  after  rounding,  the 
bit  ”1"  should  be  added  to  the  17**  bit  from  the  left.  It  is  always  added  to  the  bit 
next  to  the  Least  Significant  Bit  (LSB)  of  the  rounded  number.  Another  thing 
that  has  to  be  taken  care  of  is  the  positioning  of  the  decimal  point.  Since  the 
coefficients  might  range  from  a  very  small  number  to  a  very  large  one.  the 
designer  might  have  to  allocate  the  decimal  point  so  as  not  to  overflow  the  larg¬ 
est  number.  When  performing  the  rounding,  this  decimal  point  also  has  to  be 
considered  in  order  to  obtain  the  correct  result,  otherwise  the  state  variables  in 
the  next  cycle  will  not  be  valid  any  longer. 

6.2.3.  Simulation  Results 

Table  6-1  shows  the  output  roundoff  noise  power  of  the  filter  characterized 
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by  equation  (6-1)  with  various  block  sizes,  the  noise  powers  are  obtained  by  cal¬ 
culating  equation  (5-9).  All  the  coefficients  have  to  be  truncated  to  fit  on  16  bit 
registers  before  plugging  into  (5-9).  On  the  other  hand,  in  order  to  maintain 
high  precision  during  calculation,  the  truncated  numbers  are  expressed  by  64 
floating  point  numbers.  Table  6-2  shows  the  roundoff  noise  by  the  simulation 
technique  described  in  the  previous  sections.  It  is  easily  seen  that  the  simula¬ 
tion  results  are  close  to  those  from  the  calculation. 

The  simulation  results  also  verify  that  the  noise  performance  improves  as 
block  si2e  increases.  It  is  also  apparent  that  the  noise  power  of  the  first  sample 
is  larger  than  any  other  component  in  the  block.  The  simulated  roundoff  noise 
of  a  cascade  form  filler  is  shown  at  the  end  of  Table  6-2.  It  is  clear  that  this 
noise  power  is  close  to  that  of  the  state  filter  with  block  size  1.  However,  it  is 


impossible  to  modify  this  S1S0  form  to  improve  the  roundoff  noise  performance.  • 
Hence,  block  state  filters  are  better  structures  considering  the  roundoff  noise 
behavior. 


Roundoff  No 

se  Calculi 

tion 

Block 

Size 

Sample 

1 

Sample 

2 

Sample 

3 

Sample 

4 

Sample 

5 

Sample 

6 

Sample 

7 

1 

0.6913 

2 

0.5423 

0.2C.24 

3 

0.5073 

0.2163 

0.1349 

4 

0.4950 

0.2099 

0.1311 

0.1053 

5 

0.4S97 

0.2063 

0.1291 

0.1045 

0.0951 

6 

0.4370 

0.2051 

0.1230 

0.1037 

0.0545 

0.0903 

7 

0.4354 

0.2041 

0.1272 

0.1032 

0.0542 

-0.0901 

0.0877 

Table  6-1  Roundoff  nclse  of  block  state  filters  obtained  from  calculation 


6.3.  Conclusions 


The  speed  performance  of  Barnwell  filter  as  well  as  block  state  filter  was 
simulated  for  various  numbers  of  PE's  and  block  sizes.  The  results  verified  the 
analyses  in  Chapter  4  for  both  structures.  The  speed  of  Barnwell  filter  saturates 


Round 


HUBBl 


lion 


Sample  Sample  Sample 
5  6  7 


0.7022 

0.5461 

0.2257 

0.5201 

0.2165 

0.1331 

0.5023 

0.2143 

0.1293 

0.4975 

C.20S9 

0.1217 

0.4906 

0.1963 

0.1232 

0.4923 

0.2220 

0.1319 

0.1116 

0.1CB0 

0.1003 

0.1CS0 


0.0317 

0.0355 

0.0332 


0.0377 


Cascade  Form:  0.6979 

Table  6-2  Roundoff  noise  of  block  state  filters  .obtained  from  simulation 

if  the  number  of  PE's  used  is  beyond  some  value,  which  is  11  in  our  specific 
simulation  example.  When  the  transmission  delay  comes  into  effect,  the 
speedup  ratio  of  this  structure  drops  drastically  with  delay  time.  This  is  exactly 
what  we  expected  in  Chapter  4.  As  for  the  block  state  filter  case,  the  speedup 
ratio  does  not  saturate  with  large  block  size,  or  equivalently  more  PE’s.  Furth¬ 
ermore,  the  transmission  delay  has  very  little  effect  on  the  overall  speed.  The 
unlimited  speedup  ratio  and  the  insensitivity  to  the  transmission  delay  make 
this  structure  far  better  than  the  Barnwell  structure  in  VLSI  application. 

The  roundoff  noise  simulation  of  block  state  filters  also  verified  the  deriva¬ 
tion  in  the  previous  chapter,  the  improvement  in  this  noise  performance  with 
increased  block  size  can  be  easily  seen  from  the  simulation  results.  The  scalar 
state  filter  structure  was  also  verified  to  be  equivalent  to -the  cascade  form 
structure.  As  the  block  size  increases,  the  block  state  structure  will  outperform 
the  cascade  form. 
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CHAPTER  7 
CONCLUSIONS 


Several  algorithms  a-nd  structures  for  implementing  digital  filters  with  a 
high  degree  of  parallelism  have  been  presented.  The  motivation  behind  the  use 
of  a  parallel  or  pipelining  technique  is  to  increase  the  sampling  rate  is  the  desire 
to  implement  the  filter  in  VLSI  circuits.  VLSI  can  provide  much  more  computa¬ 
tional  hardware  in  a  single  chip,  but  without  commensurate  increases  in  speed. 
Parallel  algorithms  can  effectively  increase  the  throughput  rate  by  utilizing  the 
high  density  of  the  VLSI  circuits.  Actually,  the  effectiveness  in  the  speed  perfor¬ 
mance  improvement  is  beyond  what  we  expected,  since  the  sampling  rate  can 
be  increased  indefinitely  by  adding  hardware.  The  sampling  rate  is  thus  limited 
only  by  the  die  area,  and  not  the  speed  of  the  hardware. 

The  block  processing  of  the  input  samples  is  proposed  for  the  realization  of 
both  FIR  and  IIR  filters.  Block  processing  has  been  shown  to  be  effective  in 
increasing  the  internal  parallelism  of  a  given  filter.  Together  with  the  systolic 
idea,  block  processing  can  achieve  very  high  speed  with  only  local  interconnec¬ 
tion  among  PE's.  This  is  straightforward  for  FIR  filter  realization.  For  IIR  filters, 
state  equations  are  employed  to  model  the  filtering  function.  With  state  equa¬ 
tions,  the  feedback  computation  has  a  fixed  rate  for  a  givefi  filter  and  also  can 
be  decomposed  into  smaller  chunks,  which  are  of  fixed  size  and  are  independent 
of  the  filter  order.  The  feedforward  operation,  on  the  other  hand,  can  be  real¬ 
ized  by  two  dimensional  arrays.  Block  state  filters  break  the  limitation  on  the 
highest  speed  imposed  by  the  recursive  operation.  Higher  speed  can  be 
achieved  by  a  larger  block  size  and  a  simple  expansion  in  the  overall  structure 
without  complicating  the  communication  environment 
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HR  filters  are  known  to  require  much  less  computation  than  FIR  structures 
to  achieve  a  given  frequency  response.  If  linear  phase  is  not  essential  IIR  filters 
ere  preferred  in  order  to  save  hardware.  However,  as  block  size  in  an  HR  filter 
increases  to  process  signals  with  higher  sampling  rate,  the  average  computation 
per  output  sample  also  increases.  On  the  other  hand,  the  average  computation 
stays  the  same  as  the  block  size  increases  for  FIR  filter  structures.  Since  the 
average  computation  per  output  sample  is  a  good  indicator  of  the  hardware  size 
required,  FIR  filters  should  be  used  for  very  high  speed  filters.  It  is  shown  in 
Chapter  3  that  FIR  filters  can  also  be  realized  with  only  local  interconnections. 

Another  factor  that  will  affect  the  filter  performance  as  well  as  the 
hardware  size  is  the  finite  word  length  effect.  The  block  state  filter  is  shown  to 
have  very  low  roundoff  noise.  Further,  the  roundoff  noise  performance  improves 
as  the  block  size  increases.  Hence,  although  the  average  computation 
increases,  the  roundoff  noise  is  lower  for  higher  speed  processing.  Another 
effect  is  that  the  poles  move  toward  the  origin  as  the  block  size  increases. 
Therefore,  the  filter  is  less  likely  to  become  unstable  when  quantizing  the  filter 
coefficients. 

In  chapter  3.  the  applications  of  the  two  dimensional  systolic  array  and  the 
block  state  filters  to  the  computer  graphics  and  decimation  as  well  as  interpola¬ 
tion  are  illustrated.  However,  the  problem  of  the  coefficient  update  for  various 
rotation,  scaling  and  translation  of  computer  graphs  is  deft  open.  Further 
research  is  required  before  the  systolic  approach  can  be  applied  to  the  com¬ 
puter  graphics  area.  Furthermore,  the  two  applications  described  by  no  means 
exhaust  all  the  possibilities  for  their  application.  Further  research  is  necessary 
to  find  more  applications. 

Due  to  advances  in  IC  technology,  adaptive  algorithms  become  more 
economical  for  sophisticated  signal  processing  systems.  Therefore,  more  adap- 
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tive  or  time-varying  filters  are  used  to  achieve  higher  performance  or  to  reduce 
the  bit  rate.  These  filters,  however,  are  difficult  for  parallel  processing,  since 
the  filter  coefficients  are  changing  with  time.  Furthermore,  the  output  samples 
are  usually  fed  back  to  adapt  the  filter  coefficients  and  this  adaptation  has  to  be 
completed  in  a  fixed  time  interval.  Further  efiort  in  applying  parallelism  to 
adaptive  signal  processing  is  therefore  needed. 
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Advances  in  microprocessor  technology  will  soon  make  general  purpose 
computing  systems  composed  of  thousands  of  VLSI  processors  economically 
feasible.  A  high-performance  communication  system  to  interconnect  these  pro¬ 
cessors  is  of  crucial  importance  to  exploit  the  parallelism  inherent  in  applica¬ 
tions  such  as  circuit  simulation  and  signal  processing.  This  thesis  discusses 
issues  in  the  design  of  universal  VLSI  communication  components  to  be  used  a s 
the  building  blocks  for  constructing  robust,  high-bandwidth,  point-to-point  net¬ 
works.  The  components  provide  enough  flexibility  to  serve  a  wide  variety  of  mul¬ 
ticomputer  configurations  and  applications.  They  feature  special  purpose 
hardware  to  implement  communication  functions  traditionally  implemented 
with  network  software. 

A  communication  network  constructed  from  the  proposed  components  is 
modeled  as  a  set  of  nodes  (components)  connected  by  bidirectional  communica¬ 
tion  links.  Because  of  technological  constraints,  the  total  I/O  bandwidth  of  each 
node  is  limited  to  some  fixed  value,  and  assumed  to  be  equally  divided  among 
the  attached  links.  Increasing  the  number  of  links  per  component  leads  to  a 
reduction  in  the  average  number  of  bops  between  nodes,  but  at  the  cost  of 
reduced  link  bandwidth.  This  ’’hop  count  /  link  bandwidth”  tradeoff  is  examined 
in  great  detail  through  U/U/l  queueing  models  and  simulations  using  traffic 
loads  generated  by  parallel  application  programs.  These  results  indicate  that  a 
small  number  of  links  should  be  used.  It  is  also  found  that  a  significant  improve¬ 
ment  in  performance  is  obtained  if  a  component  is  allowed  to  immediately  begin 
forwarding  a  message  when  the  selected  output  link  becomes  idle,  regardless  of 


whether  or  not  the  end  of  the  message  has  arrived.  Finally,  mechanisms  which 
efficiently  transmit  a  single  message  to  multiple  destinations  are  seen  to  have  a 
significant  impact  on  performance  in  programs  relying  on  global  information. 

The  complexity  of  the  circuitry  required  to  implement  a  communication 
component  is  examined.  Schemes  for  providing  hardware  support  for  communi¬ 
cation  functions  —  routing,  buffer  management,  and  flow  control  —  are 
presented.  Esum&tes  cf  the  number  of  buffers  and  the  degree  of  multiplexing 
on  each  communication  link  are  determined.  The  amount  of  circuitry  to  imple¬ 
ment  a  communication  component  is  computed,  and  it  is  seen  that  the  proposed 
communication  component  could  be  implemented  with  technology  available 
today.  Design  recommendations  for  the  implementation  of  such  a  component 
are  made. 
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CHAPTER  ONE 
INTRODUCTION 


The  processing  power  of  a  general  purpose  computing  system  can  be 
increased  in  two  ways.  One  approach,  which  has  the  advantage  that  old  software 
can  be  re-used,  is  to  increase  the  speed  of  an  existing  computer  system  by  tech¬ 
nological  means  without  altering  the  basic  organization  of  hardware  com¬ 
ponents.  Much  of  this  effort  focuses  on  the  development  of  very  high  speed 
electrical  circuits  through  the  use  of  new  materials,  e.g.  Joseph  junctions 
[Ghee92]  or  gallium  arsenide  [LongBO}.  The  primary  mode  of  operation  in  such  a 
system  is  sequential,  although  limited  amounts  of  parallelism  may  be  employed 
in  certain  portions  of  the  processor.  The  huge  investment  in  existing  software 
fuels  the  effort  to  make  this  approach  commercially  viable. 

The  second  approach  to  building  high-performance  computer  systems 
relies  on  a  more  general  exploitation  of  parallelism,  e.g.  by  using  a  large  pool  of 
relatively  inexpensive  computers  that  operate  in  parallel  to  solve  a  large  prob¬ 
lem  which  has  been  decomposed  into  a  number  of  smaller  subproblems. 
Advances  in  integrated  circuit  technology  have  made  this  approach  feasible  by 
allowing  the  construction  of  chips  using  a  very  large  scale  of  integration  (VLSI) 
to  pack  hundreds  of  thousands  of  transistors  onto  a  small  piece  of  silicon.  It  is 
in  this  latter  approach  t*  at  VLSI  technology  can  have  a  truly  dramatic  impact  in 
the  structure  of  tomorrow’s  computing  systems.  This  thesis  will  focus  on  the 
exploitation  of  parallelism  to  achieve  high  performance,  and  in  particular,  on 
the  hardware  necessary  to  support  high  bandwidth  communications  among 
thousands  of  processors. 


A  key  design  parameter  of  multicomputer  systems  (systems  composed  of 
more  than  one  processor  interconnected  by  a  communication  network)  is  pro¬ 
cessor  granularity,  i.e.  the  size  and  capability  of  the  individual  processing  ele¬ 
ments.  At  one  end  of  the  spectrum,  each  processing  element  is  very  small  and 
limited  in  capability,  allowing  an  entire  multiprocessor  system  to  be  placed  on  a 
single  chip.  Examples  are  the  special  purpose  systolic  array  processors  which 
are  particularly  suitable  for  high-throughput  signal  processing  applications 
[KungBO,  Kung92],  the  ’tree-machine’  developed  at  Caltech  [BrowBO],  and  the 
Boolean  Vector  Machine  proposed  by  Wagner  [WagnB3].  Since  the  unit  to  be 
replicated  is  small,  often  consisting  of  only  an  arithmetic  unit  and  a  few  data 
registers,  the  granularity  of  the  system  is  very  fine.  The  other  extreme,  using 
very  large  granules,  is  exemplified  by  such  supercomputers  as  the  Si  which 
employs  a  few  large,  high-performance  processors  [WiddBO].  Each  processor 
consists  of  thousands  of  integrated  circuit  chips.  Commercially  available  mul¬ 
tiprocessor  systems  built  by  IBM  [Ensl74a]  or  UNTVAC  [Ensl74b]  also  belong  to 
this  category. 

Earlier  work  in  the  X-tree  project  [Desp7B,  Sequ7B]  advocated  an  intermedi¬ 
ate  granule  size  equal  to  that  of  a  single  VLSI  chip.  For  a  general  purpose  sys¬ 
tem.  some  minimum  complexity  is  required  in  each  processing  element  to  allow 
enough  flexibility  to  enable  several  to  cooperate  productively  across  a  wide 
range  of  applications.  The  simple  processor  advocated  by  the  "small  granule" 
approach  is  too  small  a  building  block  for  a  general  purpose  computer.  On  the 
other  hand,  a  very  large  granule  size  forces  closely  coupled  components  such  as 
a  processor  and  its  associated  memory  to  be  implemented  on  separate  chips, 
thus  increasing  the  performance  penalties  resulting  from  off-chip  communica¬ 
tions.  An  intermediate  granule  size  equivalent  to  a  single-chip  microprocessor 
and  its  memory  forms  an  entity  with  enough  processing  power  for  general- 
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purpose  computing,  but  is  still  small  enough  to  be  implemented  on  a  single  chip. 

Advances  in  VLSI  technology  are  making  general-purpose  computing  sys¬ 
tems  composed  of  thousands  of  processors  economically  feasible.  The  proces¬ 
sors,  however,  comprise  only  a  portion  of  the  system.  The  communication  sys¬ 
tem  that  interconnects  the  processors  is  of  equal  importance.  The  performance 
of  many  multiprocessor  systems  has  been  limited  by  insufficient  inter-processor 
input/output  (1/0)  bandwidth.  Furthermore,  the  communication  system  may 
dominate  the  hardware  cost.  In  Cm*,  for  example,  the  hardware  responsible  for 
setting  up  the  communication  paths  (i.e.  the  k-maps)  was  considerably  more 
expensive  than  that  used  in  the  processors  [Swan77b].  It  is  clearly  desirable  to 
also  exploit  VLSI  technology  to  reduce  the  cost  of  the  switching  hardware.  This 
thesis  discusses  issues  in  the  design  of  universal  VLSI  switching  components  to 
be  used  as  the  building  blocks  for  robust,  high-bandwidth,  communication  net¬ 
works  with  enough  flexibility  to  serve  a  wide  variety  of  multicomputer 
configurations  and  applications. 

1.1.  The  Concept:  Modular,  High-Bandwidth  Communication  Networks 

A  collection  of  VLSI  communication  components  that  can  be  combined  into 
networks  of  high  bandwidth  and  arbitrary  topology  is  envisioned.  Any  processor 
with  the  proper  interface  can  be  attached  to  this  communication  system.  Only  a 
few  types  of  VLSI  building  blocks  are  required,  providing  modularity  and  incre¬ 
mental  expansibility  (the  ability  to  create  a  larger  computing  system  by  adding 
hardware  to  an  existing  system).  The  goal  is  to  develop  components  that  plug 
together  easily  and  completely  hide  from  the  user  the  details  of  the  information 
transfer  within  the  network.  Just  as  the  telephone  system  hides  from  the  user 
the  details  of  routing  calls  and  transferring  voice  information,  these  new  com¬ 
munication  modules  handle  the  low-level  details  of  transferring  data  by  provid¬ 
ing  circuitry  to  perform  communication  functions  such  as  handshaking. 
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message  routing,  buffering,  and  flow  control.  For  the  system  designer,  the 
lowest  level  primitive  that  must  be  dealt  with  is  the  information  packet  or  the 
block  of  data  to  be  transmitted.  For  the  user  of  the  fined  system,  the  network 
provides  end-toend  communications  much  like  the  telephone  system. 

Figure  1.1  gives  a  conceptual  view  of  such  a  system,  divided  into  a  commun¬ 
ication  domain  (C)  using  these  VLSI  communication  components,  and  a  proces¬ 
sor  domain  (P)  dedicated  primarily  to  the  user’s  computations.  Required  pro¬ 
perties  of  the  communication  domain  include  unrestricted  network  topology, 
modularity,  incremental  expansibility,  decentralized  control,  and  the  ability  to 
recover  from  certain  classes  of  failures.  Low-latency,  high-bandwidth  communi¬ 
cations  are  required  to  achieve  good  performance  in  applications  such  as  circuit 
simulation  and  signal  processing.  This  research  will  focus  on  networks  using 


figure  1.1.  Separation  of  a  Multicomputer  into  a  Communications 
Domain  (C)  and  a  Processor  Domain  (P). 
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dedicated  links.  The  proposed  communication  domain  is  designed  to  support 
high-performance  communications  among  a  large  number,  possibly  thousands, 
of  processors. 

The  proposed  components  perform  several  basic  “store-and-forward"  com¬ 
munication  functions.  Each  component  receives  messages  from  any  attached 
processors)  and  from  other  communication  components.  Before  a  message  can 
be  forwarded  to  the  next  component/processor,  an  output  link  must  be  selected 
via  some  routing  algorithm.  After  an  output  link  has  been  selected,  the  mes¬ 
sage  is  forwarded  over  this  link.  To  handle  conflicts  which  occur  when  more 
than  one  message  is  routed  over  the  same  output  link  at  the  same  time,  buffers 
are  provided  to  hold  waiting  messages.  Each  communication  component  must 
provide  circuitry  for  managing  these  buffers.  Finally,  to  avoid  loosing  messages 
when  the  buffer  space  in  a  component  is  exhausted,  a  flow  control  mechanism  is 
required  to  throttle  arriving  traffic.  Details  of  mechanisms  which  perform  these 
functions  are  discussed  in  chapter  4,  as  well  as  estimates  of  the  amount  of  circu¬ 
itry  required  to  implement  them. 

The  types  of  processors  used  in  the  processor  domain  may  vary  depending 
on  the  application,  but  the  interface  to  the  communication  system  is  standard¬ 
ized.  This  separation  of  the  communication  domain  and  the  computation 
domain  relieves  the  processors  of  much  of  the  overhead  associated  with  the  for¬ 
warding  of  messages  destined  for  different  nodes.  It  makes  possible  the 
development  of  general  purpose  communication  hardware  that  is  suitable  for  a 
wide  range  of  applications,  and  also  provides  the  flexibility  to  construct  hetero¬ 
geneous  systems  containing  many  different  types  of  specialized  processors. 

One  may  note  the  similarity  between  the  components  described  here  and 
communication  processors  used  in  loosely  coupled  computer  networks.  An 
example  of  such  a  processor  is  the  Interface  Message  Processor  (IMP)  used  in 


the  ARPANET,  a  computer  network  linking  several  major  universities  and  institu¬ 
tions  around  the  world  [Kear70].  Indeed,  many  problems  associated  with  loosely 
coupled  computer  networks  (e.g.  routing,  buffering,  flow  control)  also  appear  in 
this  context.  However,  our  design  is  not  merely  a  scaled  down  version  of  the 
ARPANET.  The  key  differences  arise  from  the  aim  at  higher  bandwidth  and  lower 
latency,  intrinsically  lower  error  and  failure  rates  within  the  communication 
hardware,  and  the  envisioned  implementation  in  VLSI. 

1.2.  Definition  of  Terms 

A  number  of  terms  will  be  used  throughout  this  thesis.  In  order  to  avoid 
confusion,  their  meaning  in  the  context  of  this  report  will  now  be  defined. 

First,  each  message  sent  into  the  communication  domain  consists  of  some 
number  of  fixed  length  packets.  Communication  components  deal  exclusively 
-with  packets.  Here,  it  is  often  the  case  that  each  message  fits  into  a  single 
packet,  so  the  two  terms  will  be  used  interchangeably  when  no  confusion  arises. 

Each  communication  component  contains  some  number  of  ports  and  links. 
A  link  refers  to  the  collection  of  wires  connecting  a  communication  component 
to  another  such  component  or  to  a  computation  processor.  The  link  is  external 
to  the  chip.  A  port  is  the  circuitry  within  the  chip  which  drives  data  onto,  and 
receives  data  from  the  link.  When  necessary,  the  distinction  is  made  between  an 
input  port,  which  receives  data  entering  the  chip,  and  an  output  port,  which 
sends  data  away  from  the  chip.  Each  link  is  bidirectional  and  full  duplex,  Le. 
each  may  simultaneously  carry  traffic  in  both  directions.  There  is  exactly  one 
link  attached  to  each  port,  so  when  referring  to  the  "number  of  ports/links”, 
the  two  terms  are  used  interchangeably. 

A  virtual  circuit  refers  to  an  end-to-end  connection  from  one  processor 
(say  A)  through  a  certain  number  of  switching  components  to  another  processor 
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(B).  Here,  a  virtual  circuit  (or  circuit  for  short)  refers  to  the  directed  path 
through  the  network  from  A  to  B.  As  will  be  discussed  in  chapter  4,  virtual  cir¬ 
cuits  must  be  "established"  before  messages  can  be  sent,  and  all  data  sent  on 
the  same  circuit  follow  the  same  path.  In  order  to  distinguish  data  on  different 
virtual  circuits  which  are  using  the  same  physical  link,  each  link  is  divided  into 
some  number  of  virtual  channels,  with  each  channel  carrying  data  for  one  cir¬ 
cuit.  Thus,  a  virtual  circuit  is  a  sequence  of  channels  from  one  processor  to 
another. 


In  the  discussion  presented  above,  virtual  circuits  were  defined  to  have  a 
single  source  and  destination  processor.  An  exception  to  this  is  a  multicast  cir¬ 
cuit  which  has  a  single  source,  but  more  than  one  destination.  A  message  sent 
on  a  multicast  circuit  is  replicated  within  the  network,  and  a  separate  copy  is 
received  by  each  destination  processor.  Such  a  mechanism  is  useful  in  applica¬ 
tions  requiring  the  same  data  to  be  distributed  to  several  other  processors,  as 
will  be  discussed  in  chapter  3. 

Another  term  used  extensively  in  this  thesis  is  virtual  cut-through 
[Kerm79],  This  refers  to  a  mechanism  in  which  the  forwarding  of  data  packets 
can  begin  as  soon  as  the  packet  header  (here,  the  first  byte)  arrives,  if  the 
proper  outgoing  link  is  idle.  Without  cut-through,  forwarding  would  have  to  be 
delayed  until  the  entire  packet  has  arrived  in  its  buffer.  It  will  be  seen  that  this 
immediate  forwarding  mechanism  can  lead  to  a  significant  improvement  in  per¬ 
formance. 

Finally,  several  terms  are  used  regarding  the  performance  of  the  multicom¬ 
puter  and  the  communication  network.  Bandwidth  refers  to  the  amount  of 
traffic  a  communication  medium  can  carry  over  a  fixed  period  of  time,  typically 
measured  in  bits  per  second.  The  medium  may  be  a  single  communication  link 
or  the  entire  network  as  a  whole.  Delay  refers  to  the  amount  of  time  which 


elapses  from  when  a  message/packet  enters  the  communication  medium,  until 
it  leaves.  A  more  precise  definition  will  be  given  later.  Latency  is  another  term 
which  is  used  interchangeably  with  delay.  Finally,  speedup  refers  to  the  ratio  of 
the  execution  time  of  an  application  program  on  a  single-processor  computer 
system  to  the  coi  responding  time  on  a  multicomputer  system.  Intuitively,  it 
indicates  how  much  faster  the  program  executes  on  the  multicomputer. 

1.3.  Previous  Work  in  Communication  Networks 

The  research  most  applicable  to  the  work  reported  here  may  be  broadly 
divided  into  two  categories:  loosely  coupled  computer  networks,  and  intercon¬ 
nection  networks  for  tightly  coupled  multiprocessors.  Each  of  these  will  now  be 
discussed  in  turn. 

1.3.1.  Loosely  Coupled  Computer  Networks 

A  great  deal  of  research  has  been  carried  out  in  loosely  coupled  communi¬ 
cation  networks.  Although  many  of  the  constraints  and  goals  in  the  design  of 
these  networks  are  different  from  those  discussed  here,  much  of  this  research  is 
still  applicable.  A  complete  overview  of  the  literature  in  this  field  is  beyond  the 
scope  of  the  present  discussion.  Textbooks  such  as  [Davi73,  TaneSl,  KuoBl. 
AhujB2]  provide  excellent  introductions  to  the  field  as  well  as  extensive 
bibliographies.  The  research  most  relevant  to  the  communication  component 
networks  discussed  here  deals  with  message  routing  techniques  and  protocols 
for  error  free  transmission.  Other  research  in  relevant  areas  (e.g.  deadlock 
prevention)  will  be  described  later  as  the  need  arises. 

Message  routing  is  the  process  of  selecting  a  route,  Le.  a  path,  through  a 
network  from  a  processor  sending  a  message  to  the  processor  receiving  it. 
Research  in  this  area  is  usually  concerned  with  developing  general  techniques 
which  are  applicable  to  networks  of  arbitrary  topology.  An  overview  and  taxon- 


omy  of  practical  routing  algorithms  is  described  in  [McQu74,  GerlBl].  Practical 
routing  algorithms  used  by  specific  networks  have  been  described  for  several 
networks,  e.g.  Arpanet  [McQu74,  McQuBO],  Datapac  [SproBlJ,  Tymnet  [Tyme81, 
Rind77],  and  IBM's  SNA  [Juen76].  Other  heuristic  routing  schemes  have  been 
proposed,  among  them  [RoyB2,  Fran71,  ChouBl].  Finally,  routing  techniques 
based  on  more  rigorous  mathematical  performance  models  include  [Cant74, 
Gall77,  Sega77].  Networks  constructed  from  communication  components  must 
also  use  some  routing  algorithm  to  establish  virtual  circuits,  so  much  of  the 
work  described  above  is  applicable  here. 

Another  significant  area  of  research  in  loosely  coupled  networks  is  in  the 
design  of  protocols  to  ensure  reliable  transmission  of  data  through  the  network. 
A  good  survey  of  work  in  this  area  and  an  extensive  bibliography  is  reported  in 
[PouzTB].  Much  of  the  work  in  protocols  has  centered  around  the  development  of 
a  layered  structure  of  communication  protocols,  and  defining  standard  proto¬ 
cols  within  each  layer.  As  a  result  of  this  work,  a  standard  has  been  defined  by 
the  International  Standards  Organization  (ISO),  and  is  now  widely  used  by  many 
computer  manufacturers  [ZimmBO]. 

Of  special  interest  here  are  protocols  for  flow  control,  Le.  mechanisms 
which  control  the  flow  of  traffic  through  the  network.  Good  overviews  of  work  in 
this  area  are  presented  in  [PouzBl,  Pouz7B,  Kahn72].  Row  control  procedures  in 
Datapac  and  Tymnet  are  described  in  [Spro81,  TymeBl,  Hind77],  while  a 
hierarchical  flow  control  scheme  is  presented  and  analyzed  in  [Chu77]. 

Most  of  the  protocols  developed  for  loosely  coupled  networks  are  inap¬ 
propriate  for  the  networks  discussed  here.  This  is  because  these  protocols 
make  assumptions  which  are  not  valid  in  closely  coupled  multicomputer  net¬ 
works.  In  particular,  loosely  coupled  networks  cover  wide  geographic  areas  and 
are  subject  to  adverse  environmental  conditions,  so  error  rates  can  be  expected 


f 


10 

to  be  much  higher  than  in  networks  constructed  from  communication  com¬ 
ponents.  As  a  result,  protocols  in  loosely  coupled  networks  typically  pay  close 
attention  to  detecting  and  retransmitting  corrupted  messages  at  all  levels  of 
the  layered  structure.  With  low  error  rates  however,  transmission  errors  can  be 
bandied  by  high-level  (i.e.  end-to-end)  protocols,  freeing  lower  level  mechanisms 
within  the  network  to  incorporate  such  performance  improving  techniques  as 
virtual  cut-through.  Thus,  the  protocols  used  in  loosely  coupled  networks  are 
normally  too  inefficient  for  the  networks  discussed  here. 

1.3.2.  Interconnection  Networks  for  Closely  Coupled  Multiprocessors 

A  great  deal  of  research  has  also  been  done  in  the  area  of  interconnection 
networks  for  closely  coupled  multiprocessor  systems.  “Classical"  research  in 
interconnection  networks  examines  single-  and  multistage-interconnection  net¬ 
works  constructed  from  small  (typically  2  by  2)  crossbar  switches.  These  net¬ 
works  ore  discussed  in  the  context  of  establishing  processor  to  memory  or  pro¬ 
cessor  to  processor  communications.  The  bulk  of  the  remaining  research  in  the 
field  focuses  on  interconnection  topologies.  A  good  survey  of  work  in  both  of 
these  areas  is  given  in  [FengBlJ. 

The  work  in  single-  and  multiple-stage  interconnection  networks  can  be  par¬ 
titioned  into  two  categories:  networks  for  SIMD  (single-instruction  stream, 
multiple-data  stream)  computers,  and  networks  for  VGUD  (multiple-instruction 
stream,  multiple-data  stream)  computers.  A  survey  of  interconnection  networks 
for  SIMD  computers  is  given  in  [Sieg79a].  Many  of  the  SIMD  networks  are  also 
applicable  to  M1MD  machines. 

SIMD  systems  are  special  purpose  computers  typically  used  for  large  com¬ 
putational  tasks  requiring  many  vector  operations.  A  "typical"  SIMD  computer 
is  shown  in  figure  1.2.  Here,  a  number  of  processors  are  connected  to  memory 
modules  through  an  interconnection  network.  The  controller  broadcasts 
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instructions  to  the  various  processors.  All  processors  execute  the  same  instruc¬ 
tion  on  each  clock  cycle.  Each  performs  some  computation  using  data  from  one 
of  the  memory  modules.  If  data  (e.g.  elements  of  a  vector)  are  properly  distri¬ 
buted  across  the  memory  modules,  then  conflicts  in  accessing  the  memories 
can  be  avoided. 

In  the  scenario  described  above,  the  interconnection  network  effectively 
aligns  data  scattered  across  the  memory  modules.  Alternatively,  the  network 
can  be  thought  of  as  performing  some  permutation  of  input  lines  to  output  lines. 
Thus,  these  networks  are  sometimes  referred  to  as  alignment  or  permutation 
networks.  Networks  which  support  any  permutation  of  input  to  output  lines  are 


Figure  1.2.  A  typical  SIMD  Machine. 
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called  "nonblocking”.  The  crossbar  switch  [Wulf72]  and  the  Clos  network 
[Clos53]  are  examples  of  nonblocking  networks.  Nonblocking  networks  become 
prohibitively  expensive  as  the  number  of  processors  and  memory  modules 
grows,  so  less  expensive  networks  which  support  some  subset  of  all  possible  per¬ 
mutations  (called  "blocking”  networks)  have  been  explored.  Examples  of  block¬ 
ing  permutation  networks  include  the  shuffle  exchange  [Ston7l],  banyan  net¬ 
works  [Goke73j,  the  omega  network  [Lawr75],  the  flip  network  used  in  the 
STARAN  processor  [Batc76],  the  indirect  binary  n-cube  [Peas77],  the  baseline 
[WuBOa],  and  the  reverse-exchange  network  [WuBOb].  An  introduction  and  over¬ 
view  of  this  work  is  presented  in  [ChenBl].  Classes  of  networks  which  subsume 
many  of  the  specific  networks  listed  above  have  also  been  discovered,  e.g.  the 
delta  network  class  [PateBl]  and  the  multistage  cube  [SiegBl].  Thus  it  is  not 
surprising  that  many  of  the  variations  described  above  have  no,  or  only  slightly 
different,  performance  characteristics. 

Extensive  analyses  and  comparisons  of  different  permutation  networks  have 
been  performed.  For  example,  in  [Sieg79b]  bounds  are  derived  for  the  time 
required  for  some  networks  to  simulate  others.  Parker  shows  that  the  inverse 
omega  network  and  the  indirect  binary  n-cube  have  identical  switching  charac¬ 
teristics  [ParkBO],  while  in  [WuBOa]  it  is  shown  that  the  flip  network,  omega  net¬ 
work.  indirect  binary  n-cube,  and  one  form  of  the  banyan  network  are  topologi¬ 
cally  isomorphic.  Equivalence  classes  among  permutation  networks  are  defined 
in  [PradBO].  Other  analyses  describing  performance  and  permutation  properties 
include  [FranBl,  NassBl,  Than8l].  Extensions  which  allow  the  set  of  performable 
permutations  to  be  expanded,  typically  by  cascading  more  than  one  network  or 
allowing  multiple  iterations  through  the  same  network,  are  discussed  in  [YewBl, 
Wu81a],  A  theory  for  composing  the  permutations  performed  by  the  omega  net¬ 
work  is  discussed  in  [Stei83].  Finally,  parallel  algorithms  for  setting  up  the 
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switches  in  permutation  networks  are  described  in  [LevBl,  Nass82].  Although 
one  disadvantage  of  the  permutation  networks  described  above  is  that  the  time 
complexity  to  setup  an  n  input  network  given  some  permutation  is  0(n  log  n) 
with  the  fastest  known  serial  algorithms,  these  papers  allow  settings  to  be  deter¬ 
mined  in  as  little  as  0((iog  n)2)  time  in  some  situations  when  n  processors  are 
used  to  perform  the  computation. 

The  topologies  of  the  networks  described  above  can  also  be  applied  to  net¬ 
works  for  M1MD  machines.  Here  however,  average  message  delay  and  network 
bandwidth  are  used  as  performance  measures  rather  than  the  number  of  per¬ 
mutations  performed.  In  this  context,  it  has  been  shown  that  most  of  the  net¬ 
works  described  above  yield  virtually  the  same  performance  [PateBl]. 

These  interconnection  networks  represent  one  class  of  networks  which 
could  be  implemented  with  the  communication  components  described  here. 
Special  switches  designed  specifically  for  these  permutation  networks  (typically, 
2  by  2  crossbar  switches)  have  two  apparent  advantages  over  the  general  pur¬ 
pose  components  proposed  in  this  thesis.  First,  since  the  network  topology  is 
fixed,  they  may  be  optimized  for  efficient  message  routing.  However,  with  a  vir¬ 
tual  circuit  transport  mechanism  (described  in  chapter  4).  message  routing  is 
reduced  to  a  single  read  from  a  relatively  small  (a  few  hundred  entries  at  most) 
table.  In  current  technology,  this  can  be  performed  in  a  single  clock  cycle, 
where  the  clock  cycle  is  determined  by  the  rate  at  which  data  can  be  clocked 
into  a  chip,  so  any  advantage  derived  from  optimized  routing  is  minimized. 

Second,  current  implementations  of  the  simple  2  by  2  switches  require  less 
circuitry  than  the  components  described  here.  However  this  difference  is 
largely  due  to  the  improved  functionality  of  our  communication  component, 
ratber  than  some  fundamental  increase  in  complexity.  The  components 
described  here  use  more  sophisticated  buffer  management  strategies  than  are 
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typically  used  in  the  2  by  2  switches,  and  a  microcoded  engine  is  provided  for 
implementing  failure  recovery  protocols.  Since  the  performance  of  a  switching 
node  is  limited  by  I/O  bandwidth  (i.e.  there  is  some  maximum  number  of  pins  on 
each  chip  and  some  maximum  rate  at  which  each  pin  can  be  driven)  and  since 
off-chip  communications  are  typically  an  order  of  magnitude  slower  than  on-chip 
speeds  [Sequ78],  this  additional  complexity  is  not  detrimental  to  the  clock  rate. 
In  addition,  general  purpose  components  provide  enough  flexibility  to  allow  net¬ 
works  to  be  tailored  to  the  communication  needs  of  the  system.  For  example, 
more  bandwidth  could  be  placed  near  expected  points  of  congestion,  e.g.  around 
the  disks.  Thus,  the  communication  components  described  here  can  achieve  at 
least  as  much  performance  as  the  switching  nodes  in  the  permutation  networks, 
if  not  more,  as  well  as  provide  additional  flexibility  to  the  system  designer. 

Other  research  in  interconnection  networks  focuses  on  defining  attractive 
network  topologies.  This  work  can  be  classified  into  two  categories:  networks  for 
special  purpose  computation,  and  networks  for  general  purpose  computation. 
Special  purpose  network  topologies  are  aimed  toward  achieving  an  efficient 
mapping  of  some  class  of  algorithms  onto  the  network.  General  purpose  net¬ 
works  cannot  assume  any  specific  algorithm,  so  they  try  to  optimize  some  gen¬ 
eral  criteria  for  goodness,  e.g.  average  hop  count  between  pairs  of  nodes. 

An  introduction  to  research  in  special  purpose  networks  designed  for 
efficient  execution  of  specific  algorithms  is  presented  in  [Gott82].  In  [ThomBO]  a 
theory  of  VLSI  is  introduced  and  bounds  for  area/time  tradeoffs  in  implementing 
VLSI  chips  for  specific  computations  (e.g.  the  FFT)  are  derived.  Also,  the  work  in 
systolic  architectures  examines  two  dimensional  networks  suitable  for  executing 
certain  numerical  algorithms  for  signal  processing  and  matrix  manipulation 
problems  [KungBO,  KungB2],  There  has  also  been  an  extensive  amount  of  work  in 
matching  problems  to  such  well  known  topologies  as  the  perfect  shuffle 


[Ston7l],  the  mesh  [Nass79,  PrepB3],  and  tree  networks  [DekeB3,  Nath83],  The 
cube-connected-cycles  network  is  another  network  which  exhibits  properties 
favorable  for  the  efficient  implementation  of  certain  parallel  algorithms 
[PrepBl]. 

Much  of  the  work  in  topologies  for  gene*  al  purpose  computation  focuses  on 
defining  networks  which  achieve  some  characteristic  expected  to  lead  to  good 
performance  (e.g.  small  average  hop  count).  One  problem  in  this  domain  which 
has  received  some  attention  is  the  "(d,k)  graph  problem",  in  which  the  goal  is  to 
maximize  the  number  of  nodes  in  a  graph  of  degree  d,  and  diameter  k  [AckeB5, 
FrieBB,  Kom67,  Stor70.  Toue79,  ImasBl,  MemmB2,  AmarB3].  Other  topologies 
recently  proposed  for  communication  networks  include  ringed  trees  [Desp78], 
snowflakes  [FinkBO],  clusters  [WuBlb],  chordal  rings  [ArdeBl],  C$'  graphs 
[FarhBl],  binary  trees  [Horo8l],  cube  connected  cycles  [PrepBl],  hypertrees 
[Goods  1],  lens  networks  [FinkBl],  multiple  tree  structures  [ArdeB2],  and  mobius 
graphs  [LelaB2a,  LelaB2b].  Comparisons  of  some  of  these  structures  are 
reported  in  [Wittfll,  SwarB2,  ReedB3],  Finally,  other  research  examines  topolo¬ 
gies  which  are  attractive  for  fault  tolerance,  e.g.  [PradB2,  AdamB2]. 

Most  of  this  work  is  directly  applicable  to  the  networks  studied  here,  since 
it  refers  to  topologies  which  can  be  constructed  from  the  proposed  components. 
Ihese  earlier  studies  are  at  a  higher  level  of  abstraction  than  those  presented 
here  however.  While  the  work  reported  above  focuses  entirely  on  system  perfor¬ 
mance,  the  work  described  here  is  aimed  at  low  level  design  decisions,  e.g.  the 
number  of  1/0  ports  on  each  chip,  and  considers  the  constraints  imposed  by  a 
VLSI  implementation.  The  impact  of  these  constraints  on  overall  performance  is 
examined.  Some  work  in  implementation  issues  has  been  performed  by  Frank¬ 
lin,  however  this  has  been  restricted  to  studies  of  partitioning  certain  switching 
structures,  e.g.  crossbar  switches  and  banyan  networks,  into  modules  suitable 
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for  VLSI  implementation  [Fran82]. 

In  the  area  of  communication  components,  some  building  block  modules 
have  been  proposed.  In  [Hopp79]  a  packet  switched  2  by  2  crossbar  node  using 
unidirectional  links  is  proposed  as  a  switching  node.  Routing  information  is  car¬ 
ried  with  each  packet  as  a  sequence  of  bits,  with  each  bit  indicating  the  direc¬ 
tion  the  packet  is  to  follow  at  intermediate  nodes.  One  disadvantage  of  this 
scheme  is  that  the  destination  address  of  each  node  varies  according  to  the 
location  of  the  sender  of  the  message,  and  senders  are  required  to  generate  this 
routing  information  themselves.  With  arbitrary  networks,  this  computation  is 
somewhat  complex,  and  re  computations  are  necessary  if  the  topology  changes 
because  of  component  failures  or  network  expansion.  Simple  flow  control  and 
buffering  scheme  are  provided,  although  they  do  not  prevent  some  of  the  buffer 
hogging  and  deadlock  problems  discussed  in  chapter  4.  Unlike  the  design 
presented  here,  no  processor  is  provided  in  each  switching  node.  Overall,  this 
component  can  be  regarded  as  similar  in  intent  to  the  components  described 
here,  however  much  less  sophisticated  in  functionality. 

A  component  similar  to  that  described  above  is  the  Dual  Interconnecting 
Modular  Network  Device,  or  D1M0ND  [Jans80].  Again,  this  component  has  two 
input  and  two  output  links,  and  each  message  carries  detailed  routing  informa¬ 
tion  with  it.  In  [JansBO]  details  of  the  implementation  of  the  D1M0ND  are 
explained,  as  well  as  its  use  in  constructing  networks  such  as  rings  and  trees.  A 
minimal  amount  of  buffering  is  provided  in  each  component  (a  single  register  on 
each  output  port). 

Finally,  a  3  input,  3  output  link  component  called  STICS  (Synchronous  Tri¬ 
angular  Interprocessor  Connection  Scheme)  has  also  been  proposed  [Rile82]. 
These  components  can  only  be  applied  to  a  very  restricted  class  of  topologies 
however,  and  thus  are  not  as  general  as  the  components  described  here. 
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To  the  author’s  knowledge,  all  of  the  previous  work  in  VLSI  communication 
components  has  emphasized  simplicity  at  the  expense  of  generality,  functional¬ 
ity,  and/or  performance.  With  advances  in  VLSI  however,  chip  densi.ies  are 
increasing  at  a  rapid  rate,  and  more  functionality  can  readily  be  integrated  onto 
a  single  chip.  Thus,  more  sophisticated  designs  achieving  greater  functionality 
and  performance  are  becoming  practical.  The  communication  components 
described  here  represent  one  attempt  to  design  and  analyze  the  performance  of 
such  a  switching  chip. 

The  work  presented  here  is  a  continuation  and  extension  of  the  work  in  the 
communication  switch  for  the  X-tree  project  [SequTB,  Desp78].  Work  in  the  low- 
level  design  of  the  internal  structure  of  an  X-tree  node  are  described  in  [Laur79, 
Grif79,  FujiBO,  WongBl].  Perspectives  and  lessons  learned  from  these  designs  and 
the  X-tree  project  as  a  whole  are  described  in  [SequB2],  While  the  work  in  X-tree 
focussed  on  a  particular  topology,  the  components  proposed  here  provide  more 
flexibility,  allowing  construction  of  arbitrary  high-performance  communication 
networks. 

1.4.  Overview  of  Thesis 

This  thesis  focuses  on  the  design  of  VLSI  communication  components,  and 
the  impact  of  certain  design  decisions  on  system  performance.  The  remainder 
of  this  thesis  is  organized  as  follows:  In  chapter  2,  the  tradeoff  between  the 
number  and  bandwidth  of  the  communication  links  is  discussed  in  the  context  of 
a  single-chip  implementation  of  the  proposed  communication  component.  It  is 
seen  that  the  1/0  bandwidth  of  each  component  is  fixed,  and  assumed  to  be 
equally  divided  among  the  attached  links.  The  communication  network  is 
modeled  as  a  set  of  nodes  (components)  interconnected  by  communication 
links.  The  question  of  whether  each  node  should  have  a  large  number  of  low 
bandwidth  links  (implying  relatively  few  "hops”  between  a  given  pair  of  nodes) 


or  a  small  nurnbei  of  high  bandwidth  links  (implying  many  hops)  is  addressed. 
Each  node  of  a  topology  requiring  b  "branches"  or  links  to  neighboring  nodes 
can  be  implemented  by  a  cluster  of  p-port  communication  components.  M/M/1 
queueing  models  are  used  to  analyze  optimal  value  of  p,  using  average  delay  and 
total  bandwidth  of  the  “cluster  node”  as  performance  measures.  It  is  found  that 
components  with  a  small  number  of  ports  yield  cluster  nodes  with  the  most 
bandwidth,  average  delay. 

Cluster  nodes  using  components  with  a  small  number  of  ports  require  more 
chips  than  those  losing  a  larger  number  of  ports.  Thus,  the  cluster  node  studies 
do  not  consider  differing  chip  counts.  Networks  with  the  same  number  of  com¬ 
ponents  are  compared  within  certain  classes  of  network  topologies  (e.g.  lattices 
and  trees).  It  is  found  that  components  using  a  small  number  of  ports,  e.g.  from 
3  to  5,  tend  to  yield  networks  with  lower  average  delay,  but  less  bandwidth  than 
networks  using  components  with  a  larger  number  of  ports.  It  is  argued  however, 
that  while  network  bandwidth  can  be  increased  by  using  more  communication 
chips,  average  delay  cannot  be  reduced  so  easily.  Thus,  components  with  a 
small  number  of  ports  should  be  used. 

In  chapter  3.  results  of  simulation  studies  are  presented.  Here,  parallel 
application  programs  are  used  to  create  traffic  loads  for  networks  constructed 
from  communication  components.  The  traffic  loads  cover  a  wide  variety  of 
different  communication  patterns.  Both  cluster  node  networks  and  networks 
using  approximately  the  same  number  of  components  are  examined.  The  simu¬ 
lation'  results  support  the  conclusion  of  the  previous  chapter  that  a  small 
number  of  ports  should  be  used. 

Chapter  4  examines  the  design  of  a  communication  component  in  greater 
detail,  and  discusses  the  complexity  of  the  required  circuitry.  Various  mechan¬ 
isms  for  transporting  data  through  any  communication  network  are  discussed. 


and  a  mechanism  based  on  virtual  circuits  is  argued  to  be  the  most  appropriate 
for  the  networks  discussed  here.  Schemes  for  providing  hardware  support  for 
communication  functions  —routing,  buffer  management,  and  flow  control  —  are 
presented,  and  estimates  of  the  number  of  buffers  and  virtual  channels  are 
determined.  Based  on  these  estimates,  the  amount  of  circuitry  to  implement  a 
communication  component  is  estimated,  and  a  floorplan  for  one  implementation 
is  shown. 

Finally,  chapter  5  presents  concluding  remarks,  and  a  summary  of  design 
recommendations  for  implementing  general  purpose,  high-performance  VLSI 


munication  components  is  evaluated.  The  optimal  number  of  communication 
ports  for  each  chip  is  discussed  in  detail.  The  performance  improvement  result¬ 
ing  from  incorporating  a  virtual  cut-through  mechanism  into  the  communication 
hardware  is  also  studied. 

The  first  section  discusses  constraints  imposed  by  a  single-chip  implemen¬ 
tation  of  the  communication  components.  These  constraints  lead  to  a  tradeoff 
between  the  number  and  bandwidth  of  the  communication  links  (or  ports  since 
there  is  one  link  per  port).  The  following  section  discusses  analytical  studies 
evaluating  the  performance  of  various  networks  constructed  with  VLSI  communi¬ 
cation  components  as  a  function  of  the  number,  and  thus  of  the  bandwidth  of 
the  communication  links. 


2.1.  VLSI  Constraints 

A  VLSI  chip  is  subject  to  a  number  of  technological  constraints.  Violation  of 
these  constraints  will  result  in  a  chip  which  cannot  be  manufactured  in  large 
quantities,  or  which  cannot  be  depended  upon  for  reliable  operation.  For  this 
study,  the  three  most  important  constraints  are: 

(l)  Limited  amount  of  silicon  area. 


bandwidth  of  the  communication  links. 


2.1.1.  Area 

Beyond  a  certain  die  size,  the  yield,  Le.  the  fraction  of  manufactured  chips 
which  function  correctly,  decreases  dramatically  with  increased  area  [Glas78]. 
Current  technology  allows  approximately  500,000  transistors  to  be  placed  on  a 
single  chip.  It  is  projected  that  chips  with  1,000,000  device  will  be  possible  by 
1985  [Patt80].  It  will  be  demonstrated  in  chapter  4  that  this  is  more  than  ade¬ 
quate  for  the  communication  components  described  here,  so  limited  amounts  of 
silicon  area  do  not  severely  constrain  the  design  of  the  chip. 

2.1.2.  Power 

The  total  amount  of  power  generated  by  the  chip  must  not  exceed  some 
upper  bound  determined  by  the  power  dissipation  capacity  of  the  integrated  cir¬ 
cuit  package.  Since  the  average  power  dissipation  determines  the  amount  of 
heat  generated  by  the  chip,  violation  of  this  constraint  will  result  in  a  com¬ 
ponent  which  will  overheat  and  fail  during  operation.  We  will  assume  that  the 
amount  of  power  dissipated  by  the  chip  varies  linearly  with  the  number  of  links, 
Le.  the  total  amount  of  power  consumed  by  a  p-port  component  is  (Jpxp)+CV. 
where  Pp  is  the  average  amount  of  power  consumed  by  each  port,  and  Q  is  the 
power  dissipated  by  circuitry  which  is  not  affected  by  the  number  of  ports  (e.g. 
portions  of  the  control  and  routing  circuitry).  The  power  dissipated  by  this  "link 
independent"  circuitry  is  assumed  to  be  constant;  it  thus  reduces  the  total 
amount  of  power  the  port  circuitry  can  dissipate,  but  does  not  enter  into  the 
tradeoff  to  be  discussed. 


If,  for  the  moment,  we  neglect  static  power  dissipation,  then  the  power  dis¬ 
sipated  by  the  link  circuitry  is  proportional  to  the  clock  rate  [Carr72].  Doubling 
the  number  of  links  doubles  the  amount  of  circuitry,  and  thus  the  power  dissi- 


pated  by  the  chip.  Tnis  can  be  offset  by  halving  the  clock  rate,  which  in  turn, 
halves  the  bandwidth  of  each  communication  link.  Thus,  increasing  the  number 
of  links  requires  a  proportional  decrease  in  the  bandwidth  of  each  one. 

Let  us  now  consider  static  power  dissipation.  If  it  is  assumed  that  the  static 
power  dissipation  of  each  transistor  remains  constant  as  the  clock  rate  is 
varied,  then  increasing  the  number  of  ports  increases  the  number  of  transistors, 
which  in  turn  increases  both  the  static  and  dynamic  power  dissipation  of  the 
chip.  However,  reducing  clock  speed  only  reduces  dynamic  dissipation.  Thus, 
increasing  the  number  of  ports  really  implies  a  more  than  proportional  decrease 
in  link  speed.  Therefore,  the  linearity  assumption  is  biased  to  favor  a  large 
number  of  ports. 

On  the  other  hand,  a  slower  clock  rate  implies  that  smaller  transistors  may 
be  used,  resulting  in  a  reduction  in  static,  as  well  as  dynamic,  power  dissipation. 
If  the  clock  rate  is  cut  in  half,  the  current  driving  capabilities  of  (say)  an  NMOS 
transistor  may  also  be  cut  in  half,  which  in  turn  halves  the  static  power  dissipa¬ 
tion.  In  other  words,  both  static  and  dynamic  power  dissipation  are  proportional 
to  the  clock  rate.  This  is  in  agreement  with  the  original  model  which  only  con¬ 
sidered  dynamic  power  dissipation,  so  link  bandwidth  is  again  a  linear  function 
of  the  number  of  links. 

Therefore,  when  power  dissipation  is  considered,  the  linearity  assumption 
can  only  be  bta*»d  to  favor  a  large  number  of  ports.  In  the  analysis  which  fol¬ 
lows.  it  will  be  seen  that  a  small  number  of  ports  yields  better  performance 
under  the  linearity  assumption-  A  more  complex  model  which  accounts  for  the 
bias  will  only  add  further  support  for  this  conclusion.  Here,  it  will  be  assumed 
that  link  bandwidth  is  inversely  proportional  to  the  number  of  communication 
links  if  power  restrictions  constrain  the  design  of  the  chip. 
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2.1.3.  Pins 

The  number  of  interconnections  to  the  chip's  periphery  is  limited,  and  will 
increase  much  more  slowly  them  the  number  of  transistors  per  chip  [Keye79]. 
Given  N  pins  for  p  communication  links,  there  are  N/p  pins  per  link. 
Bandwidth  per  link  is  thus  proportional  to  N/p,  assuming  a  constant  bandwidth 
for  each  pin.  Doubling  the  number  of  links  halves  the  number  of  pins,  and  thus 
the  toted  bandwidth,  of  each  link.  Thus,  due  to  pin  limitations,  bandwidth  per 
link  also  varies  inversely  with  the  number  of  links. 

In  the  analysis  presented  above,  it  was  assumed  that  all  of  the  pins  of  each 
link  are  used  for  transmitting  data.  In  a  real  implementation,  some  of  the 
external  connections  may  be  used  for  control  lines.  These  control  lines 
represent  am  overhead  which  increases  with  the  number  of  links.  Doubling  the 
number  of  links  doubles  the  number  of  control  lines,  implying  fewer  pins  are 
available  for  transmitting  data.  This  results  in  a  more  than  proportional 
decrease  in  link  speed.  A  more  precise  model  which  includes  control  pins  will 
lead  to  better  performance  for  networks  with  a  small  number  of  ports,  since  the 
simplified  model  described  above  does  not  include  this  "per  link”  overhead. 
Again,  the  more  precise  model  strengthens  the  conclusions  which  follow. 

Finally,  this  model  neglects  the  effects  of  data  skew.  In  a  traditional  imple¬ 
mentation  of  a  parallel  communication  link,  the  receiver  must  wait  for  all  of  the 
arriving  bits  to  reach  a  stable  value  before  clocking  the  data.  Due  to  possible 
variations  in  propagation  delay  along  the  different  wires  of  the  link,  a  parallel 
link  must  usually  operate  at  a  slower  clock  rate  than  the  corresponding  serial 
link,  an  effect  not  accounted  for  in  the  analysis  presented  above.  These  data 
skew  problems  can  be  alleviated  by  implementing  the  parallel  link  as  a  number 
of  autonomous  serial  links,  allowing  the  link  to  operate  at  the  highest  possible 
clock  rate.  This  latter  implementation  leads  to  link  speeds  which  are  propor- 


tional  to  the  number  of  pins  per  link,  in  accordance  with  the  linear  model 
presented  above. 

2.1.4.  Summary  of  Constraints 

A  tradeoff  exists  between  the  number  of  links  per  chip  and  the  speed  of 
each  link.  If  the  chip  design  is  constrained  by  either  power  or  pin  limitations, 
then  doubling  the  number  of  links  either  halves  the  clock  rate  or  halves  the 
number  of  pins  allocated  to  each  one.  In  either  case,  the  link  bandwidth  is 
halved.  In  effect,  each  chip  has  some  total  amount  of  1/0  bandwidth  which  is 
equally  divided  among  the  existing  communication  links.  This  "constant 
bandwidth  per  chip"  model  will  be  used  in  all  of  the  studies  which  follow. 

In  addition  to  its  effect  on  link  speed,  the  number  of  ports  also  affects  the 
average  hop  count  between  two  nodes  in  the  network  (e.g.  a  ternary  tree  could 
be  used  Instead  of  a  binary  tree  if  one  more  port  were  available  for  each  node). 
As  the  number  of  links  on  each  chip  is  increased,  the  average  hop  count  between 
pairs  of  nodes  is  reduced.  The  sections  which  follow  present  analytical  and 
simulation  results  exploring  this  tradeoff  between  link  speed  and  hop  count. 

2.2.  Analytical  Studies 

In  this  section,  the  performance  of  networks  constructed  from  p -port  com¬ 
munication  components  is  evaluated  through  analytical  models.  Average  "end- 
to-end"  delay  and  total  network  bandwidth  are  used  as  performance  measures. 
The  delay  from  point  A  to  point  B  in  a  network  is  defined  as  the  time  which 
elapses  from  when  the  packet  header  begins  to  leave  A  to  when  the  entire 
packet  arrives  at  B.  The  "hop  count"  from  A  to  B  is  defined  as  the  length,  Le. 
the  number  of  links,  of  the  minimum  length  path  from  A  to  B.  Network 
bandwidth  is  the  amount  of  traffic  the  network  can  carry  over  some  fixed  time 
interval.  A  more  precise  definition  for  bandwidth  will  be  given  later. 


In  a  real  network  carrying  traffic  generated  by  a  parallel  application  pro¬ 
gram,  average  message  delay  may  not  be  an  appropriate  performance  measure. 
If  a  data  value  is  generated  in  one  processor  long  before  it  is  used  by  another, 
then  delays  encountered  by  the  message  carrying  this  data  do  not  affect  the 
execution  time  of  the  program,  so  long  as  the  data  arrives  before  it  is  needed. 
However,  since  we  cannot  know  a  priori  which  message  delays  affect  perfor¬ 
mance,  average  delays  will  be  used.  Also,  averages  are  simpler  to  compute  than 
other  measures,  e.g.  maximum  delay.  A  more  detailed  simulation  study  will  be 
discussed  in  chapter  3  which  uses  execution  time  (actually,  speedup)  as  the  per¬ 
formance  measure. 

In  order  to  evaluate  the  tradeoff  between  hop  count  and  link  bandwidth,  two 
multicomputer  network  models  are  developed.  In  the  first  model,  the  imple¬ 
mentation  of  a  topology  requiring  b  "branches"  per  node  withp-port  communi¬ 
cation  components  (p<b )  is  considered.  To  achieve  the  necessary  fanout, 
several  components  are  interconnected  to  form  a  "cluster  node"  with  b  exter¬ 
nal  branches.  Each  cluster  node  forms  a  single  conceptual  node  of  the  desired 
topology.  Delay  and  bandwidth  are  compared  for  various  values  ofp.  In  gen¬ 
eral,  a  cluster  node  using  components  with  a  small  number  of  ports  will  require 
more  components  than  one  using  a  larger  number  of  ports.  Thus,  comparisons 
under  this  model  neglect  chip  count 

A  second  analysis  compares  networks  using  the  same  number  of  com¬ 
ponents.  in  this  model,  the  hop  count/link  bandwidth  tradeoff  is  evaluated 
within  individual  classes  of  network  topologies,  such  as  trees  or  lattices. 

In  each  case,  a  queueing  model  is  used  to  evaluate  network  performance. 
The  assumptions  made  by  this  model  are  outlined  in  the  next  section.  Perfor¬ 
mance  with  and  without  a  virtual  cut-through  mechanism  is  explored.  Delay  in  a 
lightly  loaded  network  and  overall  network  bandwidth  are  computed  and  com- 
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pared  for  the  different  approaches. 

2.2.1.  Assumptions 

As  discussed  earlier,  it  is  assumed  that  the  bandwidth  of  each  communica¬ 
tion  link  is  a  linear  function  of  the  number  of  links  on  each  chip.  In  the  analysis 
which  follows,  a  queueing  model  is  used,  and  a  number  of  other  assumptions 
must  be  made: 

(1)  Message  arrivals  at  different  nodes  are  independent 

(2)  Message  arrival  times  have  a  Poisson  distribution. 

(3)  Message  lengths  have  an  exponential  distribution. 

(4)  Each  node  contains  unlimited  buffer  space. 

(5)  Routing  through  each  node  is  deterministic. 

(6)  Electrical  propagation  delays  are  negligible. 

(7)  Transmission  error  rates  are  negligible. 

The  first  three  assumptions  are  necessary  to  solve  the  queueing  model.  In 
particular,  the  first  assumption,  often  referred  to  as  the  "independence 
assumption",  states  that  "the  exponential  distribution  [for  message  length]  is 
used  in  generating  a  new  length  each  time  a  message  is  received  by  a  node  ..." 
[Klei76].  This  is  clearly  false  since  messages  maintain  their  length  as  they  pass 
through  the  network,  but  the  effect  of  the  assumption  on  the  accuracy  of  mes¬ 
sage  delay  computations  is  negligible  so  long  as  the  network  does  not  contain 
long  chains  with  no  interfering  traffic  [Kerm79].  The  assumption  is  a  reasonable 
one  for  the  networks  examined  here  because  the  traffic  loads  used  in  these  stu¬ 
dies  lead  to  output  links  which  carry  traffic  arriving  from  severed  different  input 
links,  eliminating  the  long  chains  described  above. 
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Similarly,  the  Poisson  arrival  time  and  the  exponentially  distributed  mes¬ 
sage  length  assumptions  (the  latter  implies  exponential  service  times)  allow  the 
use  of  M/M/1  queues  which  can  be  easily  solved.  Relaxing  each  of  these  assump¬ 
tions  results  in  G/M/l  and  M/G/l  queues  respectively.  If  these  queues  are  used 
however,  Jackson's  theorem  [Jack57]  cannot  be  applied,  since  the  arrival  times 
at  each  node  no  longer  follow  a  Poisson  distribution.  The  resulting  queueing 
models  are  difficult  to  solve  for  the  large,  complex  networks  studied  here. 
These  assumptions  are  simplifications  since  traffic  in  the  actual  network  need 
not  be  Poisson,  and  the  networks  considered  here  use  fixed  length  packets,  as 
will  be  discussed  in  chapter  4.  Simulation  studies  will  be  discussed  later  which 
remove  these  restrictions.  Further,  a  second  approximate  queueing  model 
using  M/G/l  queues  will  also  be  discussed  [Klei78].  Here,  the  approximating 
assumption  that  Jackson's  theorem  still  applies  is  made.  It  will  be  seen  that 
although  this  second  approximate  model  yields  performance  curves  somewhat 
different  from  the  first,  the  final  conclusions  drawn  from  the  two  models  are 
identical. 

The  remaining  assumptions  listed  above  are  appropriate  for  the  networks 
examined  here.  The  fourth  assumption,  unlimited  buffer  space,  will  be 
addressed  in  chapter  4.  It  will  be  seen  that  components  with  a  limited  number 
of  buffers  can  achieve  virtually  the  same  performance  as  components  with 
unlimited  buffering  capacity.  The  deterministic  routing  assumption  is  appropri¬ 
ate  because  packets  traveling  along  the  same  virtual  circuit  follow  the  same 
path  from  source  to  destination.  As  discussed  in  chapter  4,  this  is  necessary  to 
ensure  that  packets  sent  on  the  same  virtual  circuit  arrive  in  the  order  in  which 
they  were  sent,  thus  avoiding  much  of  the  overhead  associated  with  reassem¬ 
bling  messages  from  their  constituent  packets.  Since  communication  links  are 
short,  electrical  propagation  delays  are  negligible  (a  few  nanoseconds  at  most) 


compared  to  the  time  required  to  transmit  a  single  packet  (hundreds  or 
thousands  of  nanoseconds).  Finally,  the  assumption  concerning  error  rates  is 
justified  by  the  extremely  low  error  rates  measured  in  local  communication  net¬ 
works  [ShocSO].  Since  the  communication  system  described  here  is  confined  to 
an  even  smaller  physical  area  than  these  local  networks,  it  is  less  susceptible  to 
noise  in  the  operating  environment,  making  this  final  assumption  even  more 
appropriate. 

In  addition  to  the  “queueing  model  assumptions"  described  above,  it  is 
assumed  that  the  internal  structure  of  each  cluster  node  is  a  balanced  tree 
topology  (a  tree  with  minimal  average  path  length  between  the  root  and  leaf 
nodes  [Knut73]  ).  This  minimizes  the  average  hop  count  through  the  cluster 
node,  as  well  as  the  number  of  components  required  to  implement  a  node  with  a 
fixed  number  of  branches. 

Finally,  in  order  to  evaluate  the  performance  of  any  communication  net¬ 
work.  traffic  distribution  assumptions,  i.e.  which  processors  send  messages  to 
which  other  processors  and  how  frequently,  must  be  made.  These  will  be 
explained  during  the  analysis  as  the  need  arises.  In  general,  these  assumptions 
are  made  to  simplify  the  analysis.  Simulations  using  a  wide  variety  of  traffic  dis¬ 
tributions  are  discussed  in  chapter  3. 

2.2.2.  Model  I:  Ouster  Nodes 

Consider  the  implementation  of  a  network  topology  requiring  6  branches. 
Le.  communication  links,  for  each  node.  Each  node  could  be  implemented  with 
a  single  communication  component  requiring  6  +  1  ports,  assuming  one  port  is 
used  to  connect  to  the  computation  processor  attached  to  that  node.  Alterna¬ 
tively,  each  node  could  be  implemented  with  a  "cluster"  of  p-port  communica¬ 
tion  components,  where  3 £p£b.  As  discussed  earlier,  it  will  be  assumed  that 
the  components  within  each  cluster  node  are  interconnected  by  a  balanced  tree 


topology.  Figure  2. 1  for  example,  shows  a  node  with  4  branches  ( b  =4)  imple¬ 
mented  with  3-port  communication  components  called  “Y-components".  This 


“cluster  node”  implementation  implies  a  larger  hop  count  between  processors, 
however  it  also  uses  links  of  higher  bandwidth,  since  fewer  ports  are  required  on 
each  VLSI  chip. 

Adding  a  p -port  component  to  an  already  existing  cluster  node  adds  p  —  2 
branches.  Since  the  one  component  cluster  node  has  p—  1  branches,  an  n  com¬ 
ponent  cluster  node  has  (p-1)  +  (p  —  2)(n  -  1)  branches.  Thus,  a  b  -branch 
cluster  node  uses 


figure  2.1.  4-branch  node  built  from  Y-components. 


component::,  where  ceiling  (z)  is  defined  as  the  smallest  integer  greater  than  or 
equal  to  z . 

2.2.2. 1.  Queueing  Model 

The  queueing  model  presented  in  [Klei76]  is  used  to  evaluate  the  perfor¬ 
mance  of  a  6 -branch  "cluster  node".  In  order  to  evaluate  these  models,  traffic 
distribution  assumptions  must  be  made.  For  the  cluster  node  network,  it  is 
assumed  that  there  are  two  virtual  circuits  between  every  pair  of  branches  in 
the  cluster  node,  one  in  each  direction.  In  order  to  simplify  the  analysis,  traffic 
to  and  from  the  processors  attached  to  the  cluster  node  will  be  ignored,  and 
only  traffic  between  branches  will  be  considered.  Since  there  are  6  branches, 
there  are  b(b-l)  virtual  circuits  through  such  a  node.  Assume  that  a  traffic 
load  of  l  messages  per  second  exists  on  each  of  these  virtual  circuits,  and  each 
message  consists  of  a  single  packet  of  data. 

The  average  delay  T  through  a  store  and  forward  communication  network  is 
defined  as: 

T  = 

ij  ? 

where  yq  is  the  average  number  of  messages  per  second  entering  the  virtual  cir¬ 
cuit  from  branch  i  to  branch  j,  while  y  is  the  total  arrival  rate  on  all  virtual  cir¬ 
cuits.  Zy  is  the  average  delay  for  messages  along  the  virtual  circuit  from  i  to  j. 
It  is  assumed  that  y^  -  0  if  l  =  j.  Since  it  is  assumed  that  each  of  the  6(6—1) 
virtual  circuits  has  the  same  external  load,  l  messages  per  second, 
y  -  b  (6  -1)  l,  and  y <,•  -  l.  Thus, 

f  =  6(6-1)  §  ^  <l> 


Let  us  now  examine  Zy. 


Consider  the  path  taken  by  the  virtual  circuit  from  branch  X  to  branch  Y, 


as  shown  in  figure  2.2.  The  average  delay  along  this  path  is  equal  to 

z„  =  2r, 

where  Ti  is  the  average  delay  at  link  i.  Assume  links  are  numbered  sequentially 
from  1  to  nt.  as  shown  in  figure  2.2.  With  cut-through, 

Ti  =  ZTl^Pt)'  ~  (2) 

as  discussed  below  and  in  [Kerm79],  where 


Figure  2.2.  Virtual  circuit  from  X  to  Y. 


rr.t  sr  average  message  length 
C  =  capacity  (bandwidth)  of  the  links 
Pi  =  utilization  of  link  t 

f/,  *=  time  to  transmit  message  header  over  the  link 

tm  =  time  to  transmit  message  over  the  link 


The  message  transmission  time,  tn,  includes  the  time  to  send  both  the  data  and 
header  portions  of  the  message.  Assuming  the  total  I/O  bandwidth  of  each  p- 
port  component  is  B  bits  per  second,  C  is  equal  to  B/  p.  The  first  term  of 
equation  (2)  is  the  solution  of  an  M/M/1  queueing  model,  and  represents  the 
amount  of  time  required  to  obtain  and  transmit  a  message  over  the  link.  The 
second  term  considers  the  effect  of  cut-through,  (l  —  p*+j)  is  the  probability 
that  a  cut-through  occurs,  and  tm  —  tk  is  the  amount  of  time  "saved”  by  begin¬ 
ning  to  forward  the  message  as  soon  as  the  header  has  arrived.  It  is  assumed 
that  no  “partial"  cut-throughs  occur,  i.e.  forwarding  begins  either  immediately 
after  the  header  arrives  or  after  the  entire  packet  is  received. 

Thus  the  delay  through  the  virtual  circuit  from  X  to  Y  is 


u*v 


-l 


^rr  -  (1-Pi..)  ('»-<«) 


Z*y  measures  the  time  from  when  the  header  of  a  message  arrives  at  branch  X 
to  when  the  entire  message  has  been  forwarded  over  branch  Y. 


The  total  traffic  load  on  link  k  is  equal  to  the  number  of  virtual  circuits 
using  the  link,  say  vk,  times  1,  the  load  on  each  virtual  circuit  in  messages  per 
second.  Thus,  link  utilization  is 


Pk 


mj  l  vk  p 
B 

1 


if  fcsnj 
if  k  =nj+l 


(3) 


Assigning  Pn,+i  to  1  forces  the  (l-pitJ)  (tm-th)  portion  of  the  last  term  in  the 
summation  for  to  be  zero.  This  is  necessary  to  fulfill  the  definition  for  delay 
given  above,  Le.  the  time  which  transpires  from  when  the  head  of  the  message 


enters  the  cluster  node  until  the  time  at  which  the  end  of  the  message  leaves. 
The  equation  for  measures  the  time  from  when  the  head  of  the  packet 
enters  until  the  time  at  which  the  head  begins  to  leave.  Thus  we  must  also  add 
the  time  which  elapses  until  the  end  of  the  packet  leaves  the  cluster  node.  Set¬ 
ting  Pnj+i  to  1  accomplishes  this  by  in  effect,  eliminating  the  "saved  time" 
resulting  from  cut -through  in  the  final  node. 

Since  vk  can  be  easily  computed  for  each  link  of  a  given  cluster  node,  the 
delay  2y  for  each  virtual  circuit  can  be  found.  Once  is  known,  equation  (l) 
can  be  used  to  compute  the  average  delay  among  all  virtual  circuits  using  the 
cluster  node.  Figure  2.3  shows  the  results  of  this  computation  for  a  20-branch 
(b  =  20)  cluster  node.  The  optimal  number  of  ports  as  a  function  of  b  will  be 
studied  in  a  later  section. 

The  various  curves  correspond  to  implementations  that  differ  in  two 
respects: 

(1)  the  number  of  ports  on  each  communication  component 

(2)  whether  or  not  a  cut-through  mechanism  is  used 

The  "without  cut-through"  curves  are  obtained  by  deleting  the 
(1  —  Pi+i)  (fm  —  th)  term  in  equation  (2).  Average  delay  is  plotted  in  figure  2.3  as 
a  function  of  the  external  load  applied  to  each  virtual  circuit 

The  computations  assume  that  the  average  packet  length  m,  is  17  bytes, 
consisting  of  16  data  bytes  and  a  one  byte  header.  The  total  I/O  bandwidth  of 
each  chip,  B,  is  assumed  to  be  100  Mbits/chip-second,  and  is  equally  divided 
among  the  existing  links.  This  latter  value  was  chosen  arbitrarily  but  does  not 
affect  the  relative  ordering  of  the  curves.  These  numerical  values  will  be  used  in 
all  subsequent  computations  unless  indicated  otherwise.  From  figure  2.3,  it  is 
seen  that  network  performance  deteriorates  as  p  is  increased  for  this  particular 


cluster  node. 


To  a  first  order  approximation,  each  of  the  curves  in  figure  2.3  can  be 
represented  by  two  performance  measures: 

(1)  T* ,  the  delay  in  a  lightly  loaded  network. 

(2)  l  *,  the  maximum  traffic  load  the  network  can  support. 

T*  is  the  delay  when  l,  the  traffic  load  on  each  virtual  circuit,  is  zero,  and  l*  is 
the  asymptotic  vaiue  for  traffic  load  at  which  the  delay  approaches  infinity.  This 
latter  quantity  reflects  the  point  at  which  some  link(s)  in  the  network  approach 
100%  utilization,  leading  (mathematically)  to  queues  which  become  infinitely 
long.  In  the  real  network,  a  flow  control  mechanism  limits  the  actual  queue  size 
on  each  link,  as  will  be  discussed  later.  We  will  now  examine  delay  and 
throughput  in  turn  to  determine  the  optimal  number  of  ports  for  implementing 

CLUSTER  NODE:  BANDWIDTH  and  DELAY 
(20  branches) 

Delay  (usee) 


(Mbits /sec) 

Figure  2.3.  Queueing  delay  for  20-branch  cluster  node. 


cluster  nodes  of  any  size. 


2. 2.2.2.  Delay 

T\  the  delay  through  a  lightly  loaded  cluster  node,  is  obtained  by  setting 
the  traffic  load,  l,  or  equivalently  the  link  utilization,  p*.  equal  to  0  (except  if 
1=1^  +  !,  in  which  case  Pt  =  l).  Thus,  from  equation  (2),  the  delay  at  each  hop  is 


t.  * 

*  t 


P  "it 
B 

P  ”h 
B 


(tm  -*n)  if 

if  i=nj+l 


when  cut-through  is  used.  A  graph  of  T *  as  the  number  of  branches  increases  is 
shown  in  figure  2.4.  It  is  seen  that  cluster  nodes  implemented  using  components 
with  the  minimum  number  of  ports  yield  the  smallest  delay.  The  “bumps"  in 
figure  2.4  occur  when  a  new  component  is  added  to  the  cluster  node  as  the 
number  of  branches  is  increased.  This  leads  to  a  discontinuity  in  the  average 
hop  count,  which  in  turn  causes  a  discontinuity  in  the  delay. 

Without  cut-through,  the  delay  of  each  hop  through  a  lightly  loaded  6- 
branch  cluster  node  implemented  with  j>-port  communication  components  is 
simply  p  mj/ B  .  If  H  is  the  average  number  of  hops  through  the  cluster  node, 
then  T*  -  H  m<  p/B  .  This  function  is  also  plotted  in  figure  2.4.  It  is  seen  that 
the  optimal  number  of  ports  is  again  never  larger  than  4.  The  curves  also 
demonstrate  that  virtual  cut-through  can  significantly  improve  message  delays. 

Assuming  the  cluster  node  is  implemented  by  a  balanced  (p-l)-ary  tree,  at 
most  2  logp-tb  hops  are  required.  Thus,  the  delay  through  a  cluster  node 
without  cut-through  is 

r*  -  2mtP  1o8p-i& 

_  B 

Differentiating  with  respect  to  p  and  setting  the  result  equal  to  0  reveals  that 
minimum  delay  is  achieved  with  approximately  4.6  ports  per  component, 
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Figure  2.4.  Delay  through  cluster  node  -under  light  traffic  loads. 
agreeing  with  the  curves  in  figure  2.4. 

Thus  we  see  that  the  optimal  number  of  ports  is  relatively  small  when  con¬ 
sidering  delay  through  lightly  loaded  networks.  Delay  through  a  lightly  loaded 
cluster  node  is  minimized  when  the  number  of  ports  is  between  3  and  5. 

2. 2. 2.3.  Bandwidth 

L*  is  defined  as  the  total  network  load  when  pt  approaches  1  on  the  most 
heavily  utilized  link  in  the  cluster  node.  Since  the  links  around  the  root  of  the 
cluster  node  carry  the  most  virtual  circuits,  they  will  saturate  first  If  the  load 
on  each  virtual  circuit  at  saturation  is  l*,  then  L*  is  6(6— 1)1*.  If  the  most 
heavily  utilized  link  has  bandwidth  B/p  and  carries  v  virtual  circuits,  then 
equation  (3)  suggests  that  saturation  occurs  at  I'm*  =  B/v  p  bits  per  second. 


V  p 

bits  per  second.  A  plot  of  L*mt  as  a  function  of  the  number  of  branches  is  shown 
in  figure  2.5a.  The  curves  indicate  that  cluster  nodes  constructed  with  the 
minimum  number  of  ports  yield  the  most  bandwidth. 

The  irregular  behavior  of  the  curves  is  an  artifact  of  the  manner  in  which 
branches  are  added  to  the  cluster  node,  and  does  not  represent  a  general 
behavior  of  communication  networks.  It  is  best  explained  by  examining  the  indi¬ 
vidual  components  from  which  the  curves  are  derived.  For  a  given  value  of  p, 
the  behavior  of  L*  can  be  characterized  qualitatively  by  the  quantity  6  (6  —  1)/ v  , 
Le.  the  number  of  virtual  circuits  using  the  cluster  node  divided  by  the  number 
of  circuits  using  the  most  heavily  loaded  link.  Figures  2.5b  and  2.5c  show  plots 
of  these  two  quantities  as  a  function  of  6 .  For  clarity,  only  curves  for  p  equal  to 
3,  4.  and  5  are  shown  in  figure  2.5c.  The  remaining  curves  demonstrate  a  similar 
behavior.  It  is  seen  that  while  the  function  6  (6  -1)  yields  a  smooth  curve,  the 
curve  for  v  contains  a  number  of  discontinuities.  These  discontinuities  give  rise 
to  the  peaks  in  figure  2.5a. 

The  location  of  the  discontinuities  in  figure  2.5c  is  a  consequence  of  the 
manner  in  which  components  (Le.  branches)  are  added  to  the  cluster  node.  The 
number  of  branches  is  increased  by  adding  components  "from  left-to-right”  at 
the  leaves  of  the  cluster  node.  Under  this  scheme,  the  most  heavily  utilized  link 
is  always  be  the  "leftmost"  link  attached  to  the  root  of  the  cluster  node  (see 
figure  2.5d).  The  number  of  virtual  circuits  using  this  link,  v,  is  simply 
6j(6  —  6|),  where  bj  is  the  number  of  branches  in  the  leftmost  subtree  of  the 
root.  If  a  new  branch  is  added  to  the  cluster  node,  one  of  two  situations  occurs: 


(1)  The  branch  is  added  to  the  leftmost  subtree,  causing  both  b  and  bj  to 
increase  by  1.  and  v  to  increase  by  (b  —  b{). 
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Figure  2.5.  (a)  Bandwidth  of  cluster  node,  (b)  Circuits  in  cluster  node. 
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(2)  The  branch  is  added  somewhere  other  than  the  leftmost  subtree,  causing 

only  b  increases  by  1,  and  v  to  increase  by  bj. 

As  branches  are  added  to  the  cluster  node,  discontinuities  occur  when  the  tran¬ 
sition  is  made  between  these  two  situations,  since  the  rate  at  which  v  is  increas¬ 
ing  suddenly  changes.  This  transition  occurs  when  the  leftmost  subtree 
becomes  full,  and  when  a  new  level  is  added  to  the  cluster  node.  An  exception  to 
this  rule  occurs  ror  p  equal  to  3,  where  adding  a  new  level  does  not  cause  a 
discontinuity.  This  is  a  consequence  of  the  symmetric  nature  of  the  binary  tree. 
When  a  new  level  is  added,  the  number  of  branches  in  the  left  subtree  bt  is  equal 
to  b  —  bt,  the  number  in  the  right  subtree,  so  the  rate  at  which  v  is  increasing 
remains  the  same,  and  the  transition  causes  no  discontinuity.  Thus,  in  the 
binary  tree,  discontinuities  occur  only  when  the  left  subtree  becomes  full,  and 
new  branches  begin  filling  the  right  subtree. 

Each  discontinuity  results  in  a  peak  in  the  L*  curve.  As  b  is  increased.  L * 
increases  if  b(b-l)  is  growing  faster  than  v,  but  falls  if  v  is  growing  faster. 
Each  discontinuity  represents  a  point  at  which  the  growth  of  v  becomes 
accelerated,  causing  L*  to  fall. 

The  curves  in  figure  2.5a  indicate  that  the  bandwidth  provided  by  the  clus¬ 
ter  node  does  not  increase  significantly  as  the  number  of  branches,  and  thus  the 
number  of  components  increases.  This  is  due  partially  to  the  fact  that  the  clus¬ 
ter  node  is  implemented  as  a  tree,  and  partially  to  the  traffic  model  presented 
above.  The  traffic  model  assumes  that  there  is  a  virtual  circuit  between  every 
pair  of  branches,  and  that  all  of  the  virtual  circuits  are  equally  loaded  Thus,  the 
1/0  bandwidth  of  the  root  node  limits  the  total  bandwidth  of  the  cluster  node; 
increasing  the  number  of  components  does  not  significantly  increase  the  total 
bandwidth  provided  by  the  cluster  node. 
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When  the  root  node  links  become  congested,  most  of  the  links  of  the  cluster 
node,  Le.  those  near  the  leaves,  are  underutilized.  Thus,  virtual  circuits  which 
only  use  these  links,  i.e.  circuits  which  do  not  go  through  the  root  node,  can 
actually  handle  much  more  traffic.  Let  us  consider  the  total  bandwidth  of  the 
cluster  node  when  traffic  on  these  "underutilized"  virtual  circuits  is  allowed  to 
increase.  In  particular,  let  us  uniformly  increase  the  traffic  load  on  all  virtual 
circuits  which  do  not  go  through  the  root  node  until  more  links  begin  to 
saturate.  The  links  “highest"  in  the  cluster  node  tree  will  saturate  first.  Now 
repeat  this  process,  i.e.  increase  the  load  on  all  virtual  circuits  which  do  not  use 
saturated  links,  until  all  of  the  links  are  saturated.  The  total  load  on  all  of  the 
virtual  circuits  gives  the  maximum  traffic  load  the  cluster  node  will  support. 

With  the  traffic  load  just  described,  it  is  clear  that  all  of  the  links  of  the 
cluster  node  will  be  equally  utilized.  Such  a  network  is  said  to  be  "balance'’". 
The  bandwidth  of  a  balanced  network  is  equal  to  the  sum  of  the  bandwidths  of  all 
of  the  communication  links  divided  by  the  average  hop  count  through  the  net¬ 
work.  Intuitively,  each  link  adds  some  fixed  amount  bandwidth  to  the  network, 
and  each  virtual  circuit  uses  bandwidth  proportional  to  the  number  of  hops  it 
requires.  Thus,  this  figure  is  indicative  of  the  number  of  active  (Le.  transmitting 
data)  virtual  circuits  the  network  can  support  at  one  time,  or  alternatively,  it  is 
indicative  of  the  total  bandwidth  allocated  to  a  fixed  set  of  virtual  circuits.  It 
will  be  seen  later  that  this  intuitive  measure  of  bandwidth  can  also  be  derived 
from  a  queueing  model  for  balanced  networks. 

A  6 -branch  cluster  node  built  from  p  -port  communication  components  pro¬ 
vides  bandwidth  (see  section  2.2.2): 

for  pi 3. 

A  graph  of  this  measure  of  bandwidth  for  various  values  of  p  is  shown  in  figure 
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2.6.  Since  a  cluster  node  of  n  chips  has  a  total  link  bandwidth  which  increases 
linearly  with  n,  and  the  hop  count  increases  only  logarithmically  in  n  (assuming 
a  tree  topology  for  the  cluster  node),  one  would  expect  the  cluster  node  with  the 
most  chips  to  provide  the  most  bandwidth.  This  corresponds  to  cluster  nodes 
constructed  with  components  using  the  minimum  number  of  ports,  or  here,  3. 
The  graphs  confirm  this  intuitive  result.  Note  that  virtual  cut-through  does  not 
impact  the  bandwidth  provided  by  a  network. 

When  constructing  multicomputer  systems  with  cluster  nodes,  congestion 
at  the  root  can  usually  be  alleviated  through  the  use  of  an  appropriate  routing 
algorithm.  For  example,  figure  2.7  shows  a  grid  topology  implemented  with  Y- 
components.  An  appropriate  routing  algorithm  for  this  topology  is  to  route 
packets  along  one  direction,  say  north/south,  and  then  the  other,  east/west, 


CLUSTER  NODES:  MAXIMUM  BANDWIDTH 


Bandwidth 
(Mbits /see) 


Number  of  Branches 


Figure  2.6.  Maximum  bandwidth,  (jfehips  /hop  count )  of  cluster  node. 
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using  only  one  "90  degree  turn”.  With  this  scheme,  each  packet  travels  through 
the  root  of  a  cluster  node  at  most  three  times  —at  the  source  node,  at  the  desti¬ 
nation  node,  and  at  the  node  in  which  the  90  degree  turn  is  made.  In  general, 
this  type  of  behavior  can  be  exploited  for  any  topology  (except  trees  where 
there  is  oniy  one  path  between  any  pair  of  processors)  by  using  a  "global"  shor¬ 
test  path  routing  algorithm  through  the  network  to  increase  usage  of  the 
shorter  paths  through  the  cluster  node  which  do  not  go  through  the  root. 

Thus  cluster  nodes  built  with  communication  components  with  a  small 
number  of  ports,  say  from  3  to  5,  yield  the  least  delay,  and  cluster  nodes  built 
with  3-port  components  yield  the  most  bandwidth. 
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2.2.3.  Mode!  D:  Networks  with  a  fixed  Number  of  Components 

The  models  in  the  previous  section  demonstrated  that  higher  bandwidth 
and  lower  delays  can  be  achieved  by  implementing  b  -branch  cluster  nodes  with 
communication  components  using  relatively  few  ports.  Such  networks  require 
more  chips  than  networks  constructed  from  components  with  a  larger  number 
of  ports.  In  this  section,  we  explore  the  tradeoff  between  hop  count  and  link 
bandwidth  for  networks  with  the  same  number  of  switching  components. 

Consider  a  large,  unbounded  network  constructed  from p -port  communica¬ 
tion  components.  As  before,  assume  that  each  port  has  a  bandwidth  propor¬ 
tional  to  1/p.  It  will  be  assumed  that  there  is  one  processor  attached  to  each 
communication  component  in  the  network,  using  one  of  its  p  ports.  Thus,  in  this 
model,  p  must  be  at  least  4,  since  a  3-port  component  can  only  implement  a 
ring  topology. 

Suppose  that  for  some  application,  each  node  must  communicate  with  all 
nodes  within  R  hops  of  it.  Assume  that  there  are  M  such  nodes.  A  small  value 
for  R,  or  equivalently  M,  indicates  that  traffic  from  each  node  is  very  localized, 
while  a  larger  value  indicates  more  global  communications.  Consider  one 
specific  node  in  the  network,  say  X,  and  let  us  number  the  M  nodes  it  sends  mes¬ 
sages  to:  1,2,  •  •  ■  U.  The  average  distance  (i.  e.  hop  count)  from  X  to  these  M 
nodes  is 

where  is  the  number  of  links  traversed  in  the  shortest  path  from  X  to  i. 
Assume  that  traffic  from  X  is  uniformly  distributed  among  the  U  nodes  it  com¬ 
municates  with.  As  before,  increasing  p  will  reduce  the  average  distance,  but  at 
the  cost  of  slower  links.  Conversely,  reducing  p  implies  faster  links,  but  longer 
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distances. 


The  average  distance  H  is  clearly  dependent  on  the  topology  of  the  net¬ 
work.  In  general,  more  redundancy  (i.  e.  distinct  paths  between  pairs  of  nodes) 
implies  a  larger  H,  assuming  constant  p.  For  the  purposes  of  this  section, 
different  classes  of  network  topologies  will  be  characterized  by  a  function,  m(i ) 

(i-  1,2,  •  *  •  ,k),  with  M-^rn(i)  and  m(i)  equal  to  the  number  of  nodes  whose 

t«i 

minimum  length  path  to  node  X  is  exactly  i  hops.  The  networks  discussed  here 
are  assumed  to  be  symmetric  and  unbounded.  Since  each  node  has  p— 1  ports 
for  communicating  with  other  nodes  (one  port  leads  to  the  processor  attached 
to  that  node).  m(l)  =  p—  1.  Two  abstract  cases  will  be  discussed  here: 
lattices:  m(i )  -  m(i—  l)+(p— l) 

trees:  m  (i)  =  (p -2)xm.  (i —1)  i-2,...,fc  and 

The  first  represents  regular  two-dimensional  lattices  (see  figure  2.B)  and  the 
second  trees.  Note  that  the  latter  case  has  no  redundant  links  and  thus  gives 
minimal  U  for  any  topology  with  p  ports  per  node.  It  is  thus  a  favorable  topol¬ 
ogy  for  components  with  a  large  number  of  ports,  since  networks  using  these 
components  depend  on  small  hop  counts  to  overcome  the  handicap  of  having 
slower  links.  The  lattice  networks  represent  an  alternative  class  of  topologies 
with  less  favorable  hop  count  averages,  but  redundant  paths  between  pairs  of 
nodes. 

2.2.3. 1.  Queueing  Model 

The  cut-through  queueing  model  discussed  earlier  can  also  be  applied  to 
the  networks  presented  in  this  section.  The  symmetric  nature  of  the  traffic  load 
and  the  network  topology  leads  to  links  which  axe  equally  loaded,  i.e.  the  net¬ 
work  is  balanced.  As  before,  we  will  consider  only  traffic  within  the  network 
itself.  Delays  on  the  links  between  the  processors  and  communication  com¬ 
ponents  are  ignored. 


Figure  2.B.  Two  regular  t-wa^rnerisioruiL  Lattices.  (a)p=A.  (b)p=5. 


A  closed  form  solution  for  estimating  network  delay,  including  the  effects  of 
virtual  cut-through,  is  known  [Kerm?9].  Using  the  same  assumptions  discussed 


in  section  2.2.1,  it  can  be  shown  that  the  average  delay  to  send  a  message 


through  a  balanced  network  is 


'  -  Hr?  -  <»-» 


where: 


=  average  hop  count 
=  average  message  length 

=  total  1/0  bandwidth  of  each  communication  component 
=  number  of  ports 
=  utilization  of  each  link 

=  time  to  transmit  message  header  over  the  link 
=  time  to  transmit  message  over  the  link 


The  first  term  of  this  equation  is  the  delay  when  no  cut-through  is  used.  The 
second  term  is  the  improvement  when  cut-through  is  added.  The  effectiveness 
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of  cut-through  in  reducing  delay  increases  with  H  because  there  are  more 
chances  for  cut-through  to  occur  if  the  number  of  hops  required  is  large.  The 
dependence  on  p  arises  from  the  fact  that  the  probability  that  the  outgoing  link 
is  free,  i.e.  the  probability  that  a  cut-through  will  occur,  depends  on  how  heavily 
the  link  is  utilized.  As  before,  the  model  assumes  that  no  "partial"  cut-tbroughs 
occur,  Le.  forwarding  begins  either  immediately  after  the  header  arrives  or 
after  the  entire  packet  is  received.  The  cut-through  mechanism  has  greater 
impact  in  lightly  loaded  networks  (small  p). 

Consider  a  network  with  N  processors  (and  thus  N  communication  com¬ 
ponents),  with  each  processor  sending  messages  to  the  Ji  processors  closest  to 
it.  If  H  is  the  average  hop  count  to  reach  another  processor,  then  there  are 
virtual  circuits,  each  using  H  links.  Assume  the  load  on  each  virtual  cir¬ 
cuit  is  l  messages  per  second,  or  l  mi  bits  per  second.  Since  the  network  has 
7Vx(p— 1)  links  (excluding  the  one  connecting  to  the  processor),  the  average  load 
on  each  link  is  N  U  m*  l  3/  N(p- 1)  bits  per  second.  Therefore, 

,  =  !LnrLfT-  (5) 

Since  H  can  be  computed  numerically,  given  M  and  p,  we  can  use  equations  (4) 
and  (5)  to  compute  message  delays. 

Figure  2.9a  shows  delay  in  lattice  topologies  with  and  without ja  cut-through 
mechanism  as  a  function  of  the  load  applied  to  each  virtual  circuit.  Table  2. 1 
lists  the  number  of  virtual  circuits  using  each  link.  M  is  fixed  at  SO  nodes.  The 
optimal  number  of  ports  as  a  function  of  M  will  be  studied  in  a  later  section. 
Under  light  traffic  loads,  networks  with  a  smaller  number  of  ports  achieve  lower 
delays,  regardless  of  whether  or  not  a  cut- through  mechanism  is  used.  Figure 
2.9a  indicates  however,  that  the  "knee"  for  curves  with  a  large  number  of  ports 
is  further  to  the  right  than  that  of  those  with  a  small  number  of  ports.  This  indi¬ 
cates  that  networks  with  a  large  number  of  ports  can  maintain  reasonable 


delays  for  larger  traffic  loads  than  networks  with  a  small  number  of  ports.  In 
other  words,  these  curves  indicate  that  components  with  a  small  number  of  links 
yield  networks  with  shorter  delay,  but  less  overall  bandwidth. 


Table  2.1. 


Link  Usage 


p 

Link  Bandwidth 
(Mbits/sec) 

Circuits  per  Link 
(Lattices) 

Circuits  per  link 
(Trees) 

4 

25.00 

65.00 

57.33 

5 

20  00 

42.50 

32.50 

6 

i  r  r.y 

30.00 

24.00 

7 

14.29 

23.33 

18.00 

8 

12.50 

18.57 

13.43 

9 

11.11 

15.00 

11.50 

10 

10.00 

12.67 

10.11 

Figure  2.9b  and  table  2.1  present  the  same  analysis  for  tree  topology  net¬ 
works.  also  with  M  fixed  at  50  nodes.  Again,  networks  with  a  small  number  of 
ports  yield  better  delay  under  light  traffic  loads,  but  poorer  overall  bandwidth. 
The  minimum  number  of  ports  achieves  the  least  delay  when  a  cut-through 
mechanism  is  used,  as  would  be  expected  since  cut-through  diminishes  the 
penalties  of  traversing  extra  hops.  Networks  without  cut-through  achieve 
minimal  delay  when  5  ports  are  used,  for  this  particular  value  of  M. 

We  will  now  analyze  the  optimal  number  of  ports  as  a  function  of  traffic 
locality,  or  here,  M.  As  before,  f*.  the  delay  in  a  lightly  loaded  network,  and  l*. 
the  maximum  virtual  circuit  traffic  load  supported  by  the  network,  will  be 
evaluated  and  compared. 


2.2.S.2.  Delay 

T *,  the  delay  in  a  lightly  loaded  network  is  again  found  by  setting  p  equal  to 
0.  Thus,  from  equation  (4),  one  obtains: 


T 


T* 


m<  p  H 

B  _ 
rriip  H 

B 


(H—l  )(tm  — f*  )  with  cut-through 

without  cut-through 


These  quantities  are  plotted  in  figure  2.10  as  a  function  of  M,  the  number  of  pro¬ 
cessors  to  which  each  processor  sends  messages  (which  determines  H).  When 
cut-through  is  used,  it  is  seen  that  networks  constructed  with  the  smallest 
number  of  ports  yield  the  least  delay  for  both  lattice  and  tree  topologies.  The 
same  is  true  for  lattices  without  cut-through,  indicating  that  the  reduction  in 
hop  count  caused  by  increasing  the  number  of  ports  is  not  enough  to  adequately 
offset  the  lost  bandwidth  per  port.  The  final  case,  tree  topologies  without  cut- 
through,  is  somewhat  more  complex. 

In  tree  topologies  without  cut-through  (figure  2.10b)  it  is  seen  that  the 
smallest  number  of  ports  (p=4)  does  not  give  minimum  delay  beyond  J/=32 
nodes.  Similarly,  as  M  is  increased  further,  larger  values  of  p  appear  more 
attractive  (see  figure  2.11),  although  the  optimal  number  never  rises  beyond  6. 
Given  some  value  of  M,  the  growth  of  m(i)  (as  i  increases)  determines  the  aver¬ 
age  hop  count.  H.  The  faster  m(i)  grows,  the  smaller  H  becomes.  In  tree  net¬ 
works,  m(i)  is  an  exponential  function  of  p,  implying  its  growth  will  be 
accelerated  substantially  if  p  is  increased.  This  acceleration  is  so  substantial 
that,  to  a  certain  extent,  the  associated  reduction  in  bop  count  effectively 
offsets  the  bandwidth  loss  which  results  when  more  ports  are  used. 

For  both  classes  of  networks,  these  results  favor  a  communication  com¬ 
ponent  with  relatively  few  ports,  say  from  4  to  6.  A  cut-through  mechanism 
makes  the  optimal  number  closer  to  4.  Under  the  conditions  stated  above,  tree 
topologies  will  always  yield  lower  delays  than  lattices  because  of  lower  hop  count 
averages.  This  in  turn  results  from  the  lack  of  redundant  paths  in  tree  topolo¬ 
gies  and  is  in  agreement  with  results  already  discovered  by  other  researchers 


(number  of  nodes) 


Figure  2.10.  Delay  under  light  traffic  loads,  (a)  lattices,  (b)  trees. 
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Figure  2.11.  Delay  ■under  light  traffic  loads,  trees  (no  cut-through). 

[Desp7B].  Asymptotic  values  of  m,  p  JJ/  B  as  a  function  of  Af  -will  now  be 
derived,  in  order  to  determine  an  optimum  number  of  ports  when  no  cut- 
through  mechanism  is  provided. 

Given  such  an  abstract  topology,  we  can  treat  M  as  a  function  of  the  con¬ 
tinuous  variable  r,  the  distance  of  a  node  from  the  other  nodes  to  which  it  is 
sending  messages,  (previously,  m(i)  was  a  function  of  the  discrete  variable  i). 
As  U  grows  toward  infinity,  m(r)  is  asymptotically  equivalent  to  m(i).  With  this 
perspective,  77  is  no  longer  a  sum,  but  rather  an  integral.  Thus  we  have 

i  *  * 

77  =  ~j-fr  m(r)dr  with  M  =  Jm(r)dr 
Af  o  o 

And  the  two  cases  disussed  above  reduce  to: 

lattices:  m(r)  =  (p-l)r 
trees:  m(r)  =  (p-l)(p-2)r~i 

Evaluation  of  the  above  integrals  for  the  two  cases  results  in  the  following 


pi  4 


equations  for  delay: 


lattices:  T 


trees: 


.  _  rrij  Hp  _  rrij 
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_  rrij  H  p  rrij  fp  R(p-Z)*  p 

5  5  (p-2)*-l  ln(p—2) 


ti/iiA  /?  = 


i  —2)+p  —11  — 


In(p-2) 


The  equation  for  the  first  case  again  demonstrates  that  for  any  given  M.  a 
lower  delay  results  if  fewer  ports  are  used.  The  equation  for  the  second  case 
however,  requires  a  more  detailed  analysis. 

Minimizing  Up  by  taking  the  derivative  with  respect  to  p,  and  solving  this 
equation  numerically  yields  the  curve  in  figure  2.12.  This  curve  gives  the 
optimal  number  of  ports  (optimal  in  that  it  minimizes  the  average  delay)  as  a 
function  of  M,  the  number  of  nodes  communicated  with. 


TREE  TOPOLOGY: 

OPTIMAL  NUMBER  OF  PORTS 
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Figure  2.12.  Optimal  number  of  ports,  tree  topologies  (no  cut-through). 
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The  above  derivations  assume  that  traffic  from  a  node  is  uniformly  distri¬ 
buted  among  the  nodes  it  communicates  with.  In  practice,  one  would  try  to  map 
a  specific  problem  onto  the  multicomputer  in  such  a  way  that  there  is  more 
traffic  with  nearby  nodes  than  with  those  which  are  further  away.  Traffic 
between  neighboring  nodes  should  then  be  weighted  more  heavily.  If  one  takes 
this  into  account,  the  case  for  the  use  of  a  few  high-bandwidth  links  rather  than 
many  slower  links  becomes  even  stronger. 

Thus,  based  on  these  studies,  it  appears  that  a  communication  component 
with  relatively  few  ports,  say  from  4  to  6,  is  the  most  desirable.  If  cut-through  is 
considered,  the  argument  for  a  small  number  of  ports  also  becomes  stronger. 


2. 2. 3.3.  Bandwidth 


Let  us  consider  increasing  the  load  on  all  virtual  circuits  of  the  network.  As 
before,  network  bandwidth  is  defined  as  the  asymptotic  traffic  load  supported  by 
the  network  as  it  approaches  saturation,  i.e.  as  link  utilization  p  approaches  1. 
From  equation  (5).  the  load  per  virtual  circuit  at  saturation  I*  is 


r  = 


B 


m*  UH 

messages  per  second.  The  total  network  bandwidth  at  saturation  is 


L'm, 


P 


bits  per  second,  since  there  are  MxN  virtual  circuits,  where  N  is  the  number  of 
nodes  in  the  network.  Thus,  neglecting  the  (p-l)/p  term,  the  total  bandwidth 
of  a  network  is  approximated  by  the  sum  of  the  link  bandwidths  divided  by  the 
average  hop  count,  agreeing  with  the  maximum  bandwidth  figure  of  merit 
derived  intuitively  for  the  cluster  node  model.  This  figure  is  indicative  of  the 
maximum  number  of  active  virtual  circuits  the  network  can  support  at  one  time. 


When  comparing  networks  with  the  same  number  of  chips,  the  sum  of  the 
link  bandwidths  is  constant,  so  the  topology  with  the  smallest  average  hop  count 
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will  achieve  the  highest  bandwidth.  Thus,  in  this  case,  networks  constructed 
with  the  largest  number  of  links  per  node  yield  the  most  bandwidth.  Bandwidths 
for  tree  and  lattice  networks  are  shown  in  figure  2.13  for  various  values  of  p, 
confirming  this  intuitive  result.  The  curves  also  demonstrate  how  rapidly  net¬ 
work  bandwidth  diminishes  as  traffic  becomes  less  localized. 

2.2.4.  H/G/l  Queueing  Models 

The  queueing  models  presented  thus  far  have  assumed  that  message 
lengths  are  exponentially  distributed.  This  allows  one  to  use  M/M/1  queueing 
models  which  con  be  easily  solved.  Since  the  networks  described  here  use  fixed 
length  packets,  an  M/G/l  model  is  more  appropriate.  Unfortunately,  the  exact 
solution  of  complex  networks  of  M/G/l  queues  is  unknown,  since  Jackson’s 
theorem  can  no  longer  be  applied.  Briefly,  Jackson’s  theorem  allows  one  to 
solve  a  network  of  queues  with  Markov  arrival  rates  by  examining  each  queue 
independently,  isolated  from  the  rest  of  the  network.  Fixed  length  packets 
imply  non-exponential  service  times  which  leads  to  non-Markovian  behavior. 

An  alternative  approach  to  resolving  this  dilemma  is  to  use  M/G/l  queues, 
but  to  make  the  approximating  assumption  that  Jackson’s  theorem  can  still  be 
applied.  Other  studies  have  indicated  good  correspondence  between  this  model 
and  simulation  results  [Klei78].  The  Pollaczek-Khinchin  mean  value  formula  indi¬ 
cates  that  replacing  an  M/M/1  queue  with  one  using  fixed  service  times  reduces 
the  waiting  time  (Le.  the  time  spent  waiting  for  the  link  to  become  free)  by  a 
factor  of  two  [Xlei75].  In  the  analysis  presented  thus  far,  this  implies  that  equa¬ 
tion  (2)  for  the  cluster  node  model  becomes 

Ti  ~  T[C(1  +  ~C~  ~  ^  (tm  ~ 

while  equation  (4)  for  the  second  model  becomes 


-(tf -D(i-p)  (tm  -  th) . 


T  -  M  —Hz  p  I  H  p 
2  [S(l-p)  B  \ 

Closer  examination  of  these  equations  reveals  however,  that  delay  in  lightly 

loaded  networks  (delay  as  p  approaches  0)  and  network  bandwidth  (traffic  load 

as  p  approaches  1)  are  identical  to  that  in  the  M/M/1  model.  Thus,  the  M/G/l 

queueing  models  yield  curves  with  the  same  relative  orderings  as  those  derived 

for  the  M/M/1  mode  is. 

2.2.5.  Summary  of  Analytic  Results 

The  analytic  results  for  the  optimal  number  of  ports  are  summarized  in 
table  22  below. 


Table  22. 


model 

delay 

bandwidth 

cluster  nodes 

small 

small 

fixed  number 
of  components 

small 

large 

When  considering  delay,  all  of  the  analytical  models  presented  here  indicate  that 
better  performance  is  achieved  with  communication  components  with  a  rela¬ 
tively  small  number  of  ports,  say  from  3  to  6.  Virtual  cut-through  reduces  the 
impact  of  larger  hop  counts,  and  thus  pushes  the  optimal  number  of  ports  closer, 
to  3.  It  is  seen  that  a  cut-through  mechanism  can  substantially  reduce 
transmission  delays  in  the  network,  so  it  is  unreasonable  to  exclude  it  from  any 
communication  component  design. 

When  considering  bandwidth,  the  cluster  node  model  favors  components 
with  the  minimum  number  of  ports,  while  the  ’’one  communication  component 
per  processor”  model  favors  a  large  number  of  links.  It  is  important  to  realize 
however,  that  the  overall  bandwidth  of  a  network  can  be  increased  by  adding 
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more  chips,  since  the  sum  of  the  link  bandwidths  grows  faster  them  average  hop 
count  in  most  topologies  (rings,  which  are  generally  considered  to  be  unsuitable 
for  networks  with  a  large  numbers  of  processors,  are  an  exception).  This  is 
verified  by  the  cluster  node  model,  where  networks  constructed  from  com¬ 
ponents  with  a  smaii  number  of  ports  achieved  greater  bandwidth.  Thus,  achiev¬ 
ing  low  latency  appears  to  be  the  more  important  problem,  leading  to  further 
support  of  communication  components  with  a  small  number  of  ports. 

It  is  important  to  remember  that  these  bandwidth  studies  measure  max¬ 
imum  network  bandwidth,  and  thus  only  consider  performance  under  heavy 
traffic  loads.  In  a  lightly  loaded  network,  the  bandwidth  available  to  individual 
virtual  circuits  is  equal  to  the  bandwidth  of  the  communication  links  it  uses  and 
thus  will  be  larger  if  components  with  a  small  number  of  ports  are  used.  There¬ 
fore,  when  combined  with  the  analytic  results  presented  in  this  section,  one 
must  conclude  that  providing  general  purpose  communication  components  with 
a  small  number  of  ports,  say  from  3  to  5,  is  the  best  choice. 

The  analysis  presented  above  made  a  number  of  simplifying  assumptions. 
The  strongest  assumption  concerned  the  traffic  distributions  among  processors. 
Simulation  studies  which  explore  a  number  of  different  traffic  distributions  will 
be  discussed  next.  It  will  be  seen  that  for  the  most  part,  these  simulations  sup¬ 
port  the  conclusions  derived  analytically.  When  discrepancies  do  occur,  the 
simulations  indicate  better  performance  for  components  with  a  small  number  of 
ports,  thus  strengthening  the  conclusion  that  a  small  number  of  ports  should  be 
used. 
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CHAPTER  THREE 
SIMULATION  STUDIES 
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The  analytical  models  presented  earlier  made  some  simplifying  assump¬ 
tions.  In  particular,  traffic  distributions  were  assumed  to  be  such  that  links  are 
equally  utilized,  message  arrivals  were  assumed  to  follow  a  Poisson  distribution, 
and  message  lengths  were  assumed  to  follow  an  exponential  distribution.  To 
evaluate  the  conclusions  derived  by  the  analytical  models  when  these  assump¬ 
tions  are  relaxed,  and  to  gain  deeper  insight  into  the  tradeoffs  between  various 
network  topologies  and  realizations  of  the  communication  components,  a  simu¬ 
lation  program  was  developed.  The  results  of  these  simulation  studies  are  dis¬ 
cussed  in  this  section.  An  instruction  level  simulator  called  Simon  is  described, 
and  the  respective  speedups  resulting  from  executing  several  parallel  applica¬ 
tion  programs  on  various  network  structures  are  reported. 

The  first  two  sections  describe  the  simulator  and  the  assumptions  made 
about  the  multicomputer  system.  Following  this,  the  application  programs  are 
described,  and  simulation  results  are  presented.  Some  of  the  issues  evaluated 
by  this  study  include  the  optimal  number  of  ports  and  the  effect  of  incorporat¬ 
ing  a  mechanism  in  the  communication  hardware  for  efficiently  handling 
multiple-destination  messages. 

3.1.  The  Simulator:  Simon 

Simon  (Simulator  of  Multicomputer  Networks)  is  a  discrete-time,  event- 
driven  simulation  program  designed  to  facilitate  comparison  of  alternate  switch¬ 
ing  structures  [Fuji83].  The  most  important  features  of  Simon  are: 
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(1)  Traffic  in  the  communication  domain  is  generated  by  application  programs 
executing  some  parallel  algorithm.  This  is  in  contrast  to  the  analytical  stu¬ 
dies  which  made  the  simplifying  assumptions  that  links  are  equally  loaded 
and  message  arrivals  follow  a  Poisson  distribution. 

(2)  The  software  modeling  the  interconnection  network  is  contained  in  a 
separate  module  called  the  "switch  model”,  allowing  easy  comparison  of 
different  switching  structures. 

The  simulator  consists  of  three  components  (see  figure  3.1):  the  application  pro¬ 
gram,  the  simulator  base,  and  the  switch  model.  The  application  program  con¬ 
sists  of  a  number  of  tasks,  or  equivalently,  processes,  which  execute  in  parallel 
and  communicate  by  exchanging  messages.  The  simulator  base  time- 
multiplexes  execution  of  the  tasks  on  the  host  computer,  in  this  case  a  VAX- 
11/780.  The  base  also  keeps  track  of  time  for  each  task  (each  task  has  a  clock 
which  advances  as  the  task  executes)  to  ensure  that  interactions  among  tasks 
(e.g.  message  transmissions)  are  simulated  in  the  proper  time  sequence. 
Finally,  the  switch  model  provides  a  fixed  virtual  circuit  interface  for  the  tasks 
and  simulates  message  passing  between  processors.  A  detailed  description  of 
the  simulator  is  given  in  [FujiB3]. 

3.2.  Assumptions 

A  number  of  assumptions  are  made  in  the  simulation  experiments  reported 
here.  These  include: 

(1)  negligible  operating  system  overhead 

(2)  VAX  11/780  processing  elements 

(3)  one-to-one  mapping  of  tasks  to  processors 

(4)  fixed  length  packets  (l  byte  header,  16  data  bytes) 
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Figure  3. 1.  Block  diagram  of  simulator. 

(5)  unlimited  buffering  within  each  communication  component 

(6)  error  free  transmission 

(7)  virtual  cut-through 

(8)  virtual  circuits  set  up  In  advance 

(9)  shortest  path  routing 

Each  of  these  assumptions  will  now  be  discussed  in  turn. 


(1)  The  bulk  of  the  simulation  studies  assume  that  the  time  to  execute  am 
operating  system  routine  for  invoking  a  communication  mechanism  (e.g. 


sending  a  message)  is  negligible.  This  allows  separation  of  the  penalty  due 
to  operating  system  overhead  from  that  inherent  in  the  communication 
switch.  Studies  which  analyze  the  impact  of  operating  systems  overhead 
alone,  i.e.  which  assume  negligible  communication  delays,  will  also  be  dis¬ 
cussed. 

(2)  The  speed  of  the  processing  elements  is  fixed  throughout  the  simulations  to 
that  of  a  VAX-11/780.  By  the  mid  1980’s,  32-bit  microprocessors  will  have 
achieved  this  performance  level  [Patt82].  However,  the  absolute  speed  of 
the  processors  is  not  so  important  as  the  ratio  of  processor  speed  to  com¬ 
munication  delays,  since  this  affects  the  fraction  of  time  each  processor 
spends  performing  computations  relative  to  the  time  required  for  commun¬ 
ications.  As  technology  improves,  the  computational  speedup  due  to  the 
use  of  multiple  processors  is  unchanged  if  this  ratio  remains  the  same, 
since  both  uniprocessor  and  multicomputer  execution  times  decrease  by 
the  same  factor.  If  however,  processor  speed  increases  at  a  faster  rate 
than  communication  speed,  the  ratio  changes.  Communication  delays  will 
prevent  the  multicomputer  execution  time  from  decreasing  in  proportion 
to  that  of  the  uniprocessor,  and  speedup  actually  decreases.  Here,  since 
the  “VAX”  assumption  implies  a  constant  processor  speed,  this  ratio  is 
changed  by  varying  communication  bandwidth.  This  provides  the  flexibility 
of  determining  performance  under  1983  technology,  as  well  as  predicting 
the  effect  of  technological  changes. 

(3)  It  is  assumed  that  each  task  executes  on  a  separate  processor.  In  other 
words,  it  is  assumed  that  the  system  contains  enough  processors  to  accom¬ 
modate  the  application  program.  The  programs  studied  here  use  at  most 
32  processors,  so  this  is  a  reasonable  assumption.  Indeed,  general  purpose 
systems  using  more  than  32  processors  have  already  been  constructed 
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[Swan77a,  StriB3,  KushB2,  HoshB3]. 

(4)  Packets  consist  of  a  single  control  byte  followed  by  16  data  bytes.  Fixed- 
size  packets  are  used  because  of  the  difficulties  associated  with  managing 
variable  sized  buffers,  as  discussed  in  chapter  4.  This  is  in  contrast  to  the 
analytic  models  described  in  chapter  2  which  made  the  simplifying  assump¬ 
tion  that  message  lengths  follow  an  exponential  distribution.  The  control 
byte  is  used  to  specify  a  virtual  channel  number,  as  will  be  discussed  in 
chapter  4.  In  the  application  programs  discussed  here,  messages  are  short, 
typically  consisting  of  only  a  single  floating  point  number,  and  flt  within  a 
single  packet.  An  area  of  future  research  is  to  consider  workloads  which 
include  large,  multi-packet  messages,  e.g.  paging  traffic  and/or  file 
transfers. 

(5)  It  is  assumed  that  adequate  buffer  space  is  available  in  each  component  for 
holding  packets  waiting  to  be  forwarded.  It  will  be  shown  later  that  chip 
densities  now  allow  each  component  to  provide  enough  buffer  space  to 
achieve  approximately  the  same  performance  as  a  component  with  an 
unlimited  amount  of  buffering. 

(6)  The  simulator  assumes  that  no  errors  occur  during  data  transmissions. 
This  assumption  was  also  used  in  the  analytical  models,  and  was  justified  in 
the  discussion  there. 

(7)  Virtual  cut-through  is  used  in  all  networks.  Partial  cut-throughs  are 
allowed,  i.e.  if  the  outgoing  link  used  by  a  packet  is  busy  when  the  header 
arrives,  but  becomes  free  before  the  tail  arrives,  the  packet  need  not  wait 
for  the  latter  event  before  it  begins  using  the  link.  The  analytical  results 
presented  earlier  indicated  that  substantial  improvements  can  be  achieved 
with  virtual  cut-through,  so  it  is  unreasonable  to  exclude  it  from  any  design. 


(8)  It  is  assumed  that  all  virtual  circuits  are  set  up  before  the  tasks  begin  exe¬ 
cution.  All  of  the  application  programs  studied  are  static  in  the  sense  that 
new  tasks  are  not  created  after  execution  begins.  Since  the  programs  exe¬ 
cute  for  long  periods  of  time,  the  set-up  time  is  negligible  relative  to  the 
toted  execution  time.  Thus,  it's  effect  on  overall  performance  can  be 
neglected. 

(9)  Finally,  the  simulator  uses  a  shortest  path  routing  algorithm  to  set  up  its 
virtual  circuits.  Within  the  simulator,  Floyd’s  algorithm  [Floy62]  is  used  to 
perform  this  computation.  To  prevent  unfair  comparisons,  one  routing 
algorithm  was  used  throughout  all  of  the  simulation  studies.  Shortest  path 
routing  was  selected  because  it  has  a  simple  implementation  and  because  it 
has  some  prospect  of  achieving  good  performance  since  it  minimizes  the 
amount  of  network  resources,  Le.  bandwidth,  required  for  each  virtual  cir¬ 
cuit.  Evaluation  of  more  sophisticated  routing  algorithms  is  a  topic  of 
future  research. 

3.3.  The  Application  Programs 

Traffic  distributions  are  generated  by  application  programs  executing 
parallel  algorithms.  For  the  purposes  of  this  study,  an  application  program  is 
characterized  by  the  communication  pattern  it  generates.  In  particular,  com¬ 
munications  are  characterized  by  the  structure  of  communications  between  the 
program  anu  its  surrounding  environment,  and  the  pattern  of  communications 
within  the  program,  i.e.  among  its  tasks. 

External  communications  between  the  parallel  program  and  its  environ¬ 
ment  are  assumed  to  fall  into  one  of  two  categories: 

(1)  serial  input,  serial  output  (S1S0). 


(2)  parallel  input,  parallel  output  (P1P0). 

These  two  communication  patterns  are  shown  in  figure  3.2.  In  S1S0,  the  input 
data  arrives  from  (is  sent  to)  a  single  source  (destination).  In  P1P0,  the  data 
arrives  (leaves)  in  parallel  from  (to)  severed  sources  (destinations). 

Several  of  the  application  programs  implement  signal  processing  functions 
which  use  an  S1S0  communication  pattern.  A  single  processor  samples  the  input 
waveform  and  distributes  the  data  values  to  a  number  of  the  other  processors 
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Figure  3.2.  Communication  patterns  for  application  programs. 
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which  collectively  compute  results.  Another  processor  collects  the  output 
waveform.  In  other  situations,  a  P1P0  structure  might  arise.  For  example,  the 
application  program  could  be  one  of  several  job  steps,  each  of  which  is  imple¬ 
mented  as  a  separate  parallel  program.  Since  the  input  (output)  of  each  job 
step  comes  from  (goes  to)  another  parallel  program,  one  can  expect  data  to 
arrive  (leave)  in  parallel.  'While  other  communication  patterns  are  possible,  e.g. 
SIPO  or  P1S0,  these  are  only  combinations  of  the  patterns  presented  above,  and 
are  not  fundamentally  different. 

The  internal  communication  paths  are  also  partitioned  into  two  categories: 

(1)  global 

(2)  local 

As  the  name  implies,  global  communications  implies  that  each  task  communi¬ 
cates  with  all,  or  nearly  all  of  the  other  tasks.  Local  communications  implies 
each  task  communicates  with  a  small  subset  of  the  other  tasks.  The  programs 
studied  here  that  use  local  communications  are  pipelined.  Thus,  the  communi¬ 
cations  are  local  in  the  sense  that  each  stage  of  the  pipeline  sends  messages 
only  to  the  next  stage,  and  not  to  previous  or  subsequent  stages.  Although  this 
communication  structure  does  not  exhibit  loops  among  tasks  in  different  stages 
of  the  pipeline,  loops  may  exist  among  tasks  within  the  same  stage.  Programs 
that  exhibit  loops  among  tasks  in  different  stages  are  considered  to  belong  to 
the  class  with  global  communications. 

Six  application  programs  demonstrating  several  different  traffic  patterns 
were  run  on  Simon.  Each  uses  one  of  the  four  combinations  of  the  parameters 
described  above.  These  are: 

(l)  Barnwell,  a  signal  processing  program  using  Barnwell’s  algorithm  (global 
SISO) 


(2)  Block  I/O,  a  signal  processing  program  using  block  filters  (local  SISO) 

(3)  Block  State,  a  second  program  also  using  block  filters  (local  SISO) 

(4)  FFT,  a  program  for  computing  Fast  Fourier  Transforms  (local  PIPO) 

(5)  LU,  a  program  for  performing  LU  decomposition  on  a  sparse  matrix  (global 
PIPO) 

(6)  Random,  a  program  generating  artificial  traffic  loads  (global  PIPO) 

The  communication  patterns  exhibited  by  these  programs  are  summarized  in 
table  3.1  below. 


Table  3.1 

Communication  Structures 
Used  by  the  Test  Programs 


SISO 

PIPO 

global 

Barnwell  (12  tasks) 

Random  (12  tasks) 
LU  ( 15  tasks) 

local 

Block  I/O  (23  tasks) 
Block  State  (20  tasks) 

FFT  (32  tasks) 

All  of  these  programs  communicate  relatively  small  amounts  of  data  fre¬ 
quently.  Typically,  a  task  waits  for  data  values  to  arrive  from  other  task(s),  per¬ 
forms  some  floating  point  operations  on  them,  and  then  generates  a  result  which 
is  passed  on  to  another  task(s).  The  number  of  processors  ranges  from  12  in  the 
Barnwell  program  to  32  in  the  FIT.  Each  of  these  programs  will  now  be  dis¬ 
cussed  in  greater  detail. 

3.3.1.  Barnwell  Filter  Program  (global  SISO,  12  tasks) 

The  Barnwell  filter,  and  the  two  programs  which  follow,  implement  the  digi¬ 
tal  filter  defined  by  the  equation: 

Yn  = 

<«l  <«c 

Vectors  X  and  Y  are  the  input  and  output  waveforms,  A  and  B  characterize  the 


filter  being  implemented,  and  N  and  M  are  the  number  of  poles  and  zeros  in  the 
filter  respectively.  The  programs  presented  here  use  M  -  N  -  7. 

An  "input  task"  distributes  a  total  of  400  samples  of  the  input  waveform 
(the  real  multicomputer  would  collect  this  data  from  a  sensor  at  some  sampling 
frequency)  to  some  number  of  "computation  tasks".  An  "output  task"  collects 
the  output  waveform  computed  by  the  computation  tasks.  Thus,  all  three  pro¬ 
grams  have  an  3130  communication  pattern.  When  all  of  the  400  input  samples 
have  been  processed,  execution  terminates.  It  is  assumed  that  the  sampling 
frequency  is  large  compared  to  the  rate  at  which  data  points  can  be  processed. 
This  ensures  that  the  execution  time  is  not  limited  by  the  input  data  rate.  Thus, 
at  time  0,  the  input  processor  begins  distributing  the  400  data  points  to  the 
computation  processors  and  never  waits  for  input  data. 

The  Barnwell  program  computes  the  filtering  function  using  Barnwell's  algo¬ 
rithm  [BamTB,  Bam79,  HodgBO,  BamB2,  LuB3].  The  two  signal  processing  pro¬ 
grams  which  follow  use  a  different  technique  for  performing  the  calculations.  In 
Barnwell,  twelve  tasks  are  used,  as  shown  in  figure  3.3.  Each  node  in  figure  3.3 
represents  a  task,  and  each  arc  a  virtual  circuit  An  arc  which  fans  out  to 
several  destinations  represents  a  broadcast  communication. 

The  Barnwell  program  uses  ten  tasks  to  execute  the  signal  processing  algo¬ 
rithm.  This  is  the  maximum  number  of  processors  the  algorithm  can  effectively 
use  in  performing  the  computation,  assuming  small  communication  delays.  This 
number  is  a  function  of  the  number  of  poles  in  the  filter  being  implemented. 
Each  computation  processor  receives  40  input  samples. 

Each  data  point  received  by  a  computation  processor  is  combined  with  data 
generated  by  other  processors.  The  result  is  then  broadcast  to  the  six  proces¬ 
sors  immediately  "to  the  right"  of  that  processor.  These  communication  paths 
are  shown  in  figure  3.3.  The  communication  pattern  for  the  Barnwell  program  is 
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Figure  3.3.  Cbmmweicotiem  paths  far  Barnwell  program. 


classified  as  global  S1S0,  although  communications  are  really  only  approxi¬ 
mately  global  since  each  computation  processor  does  not  communicate  with  all 
others. 


3.3.2.  Block  I/O  Filter  Program  (local  S3SO.  23  tasks) 

The  Block  State  and  Block  1/0  programs  perform  the  filtering  function 
described  above  by  grouping  the  input  samples  into  blocks  and  then  processing 
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each  block  as  a  single  ur.it.  The  resulting  communication  patterns  are  local 
SISO.  These  algorithms  have  the  advantage  that  the  block  size  can  be  varied  to 
change  the  performance  of  the  system.  A  larger  block  size  requires  a  larger 
number  of  processors,  but  increases  the  rate  at  which  input  samples  can  be  pro¬ 
cessed.  Increasing  block  size  does  incur  a  latency  penalty  however.  The  amount 
of  time  between  reception  of  the  first  input  sample  and  the  generation  of  an  out¬ 
put  waveform  increases.  In  practice,  one  would  use  the  minimum  block  size 
which  allows  the  input  samples  to  be  processed  in  real  time;  this  minimizes  the 
latency  as  well  as  the  number  of  processors. 

Here,  the  Block  I/O  and  Block  State  programs  use  the  minimum  block  size 
in  order  to  minimize  latency.  This  minimum  size  is  related  to  the  number  of 
poles  in  the  filter.  Given  this  block  size,  the  computation  is  structured  to  use  as 
many  processors  as  required  to  exploit  the  parallelism  inherent  in  the  computa¬ 
tion.  The  Block  I/O  program  uses  23  tasks  which  are  structured  as  a  two-stage 
pipeline,  as  shown  in  figure  3.4.  Communications  within  the  second  stage  are 
global,  so  the  program  is  actually  somewhat  intermediate  between  local  and  glo¬ 
bal  SISO.  Note  that  input  samples  must  be  broadcast  to  several  other  tasks. 
Details  of  the  algorithms  implemented  by  this  program  can  be  found  in  [Burr71, 
Burr72,  Mitr7B.  Lu83]. 

3.3.3.  Block  State  Filter  Program  (local  SISO.  20  tasks) 

The  Block  State  program  uses  the  same  ’’blocking’’  techniques  discussed  in 
Block  1/0.  This  program  however,  uses  a  somewhat  different  approach  to  per¬ 
form  the  computation,  and  as  a  result  includes  information  of  the  internal 
behavior  of  the  filter  as  well  as  the  input-output  relationships.  Thus,  it  allows 
the  determination  of  some  intermediate  values  which  the  Block  I/O  program 
does  not  compute.  As  before,  the  minimum  block  size  is  used,  resulting  in  a 
computation  which  requires  20  teaks.  The  communication  paths  for  this 


Figure  3.4.  Cbrnmxmacation  paths  for  Block  I/O  program. 

program  are  shown  In  figure  3.5.  It  is  seen  that  the  computation  uses  a  4  stage 
pipeline,  and  thus  exhibits  a  local  SISO  communication  pattern.  Again,  input 
samples  are  distributed  via  multiple-destination  messages.  Further  details  of 
the  algorithms  used  in  the  Block  State  program  can  be  found  in  [BarnBOa, 
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BLOCK  STATE  PROGRAM 


Figure  3.5.  Cb  mmunic  ation  Paths  for  Block  State  program. 

3.3.4.  FFT  Program  (local  PIPO,  32  tasks) 

This  program  performs  a  complex  16  point  Fast  Fourier  Transforms  on  sets 
of  input  values.  The  FFT  algorithm  is  used  to  compute  the  Fourier  coefficients 
for  an  analog  signal.  The  input  consists  of  400  sets  of  complex  input  values, 
x0  •  •  ■  xis.  The  output  consists  of  400  sets  of  complex  numbers  y0  *  *  ■  Vis*  such 
that 

Vi  -  2**exP ((~2rr/ 16)ifc) 

**c 

Details  of  the  algorithm  used  to  perform  this  computation  in  time  proportional 


.v:. 


to  N\ogN  {here,  N=  16)  are  discussed  in  e.g.  [Baas78]. 


The  communication  paths  used  by  this  program  are  shown  in  figure  3.6 
Since  the  same  computation  is  performed  on  several  sets  of  input  data,  the 
computation  can  be  pipelined.  The  input  data  are  assumed  to  reside  in  the  pro¬ 
cessors  comprising  the  first  stage  of  the  pipeline,  so  the  resulting  communica¬ 
tion  paths  are  local  P1P0. 


FFT  PROGRAM 


Figure  3. 8.  Cbmmunic  a tion  paths  for  FFT  program. 


3.3.5.  LU  Decomposition  (global  P1P0,  15  tasks) 

This  program  performs  LU  decomposition  on  a  sparse  matrix.  LU  decompo¬ 
sition  is  a  well  known  technique  for  solving  a  set  of  linear  equations.  Suppose  a 
set  of  equations  is  specified  as 

ax=  y 

where  A  is  a  known  n  by  n  matrix,  Y  is  a  known  column  vector  of  length  n.  and 
X  is  an  unknown  column  vector  also  of  langth  n.  The  solution  to  this  equation 
can  be  found  by  factoring  the  A  matrix  into  two  components,  L  and  U,  and  then 
solving  the  equations 

LB  =  Y  and  UX  -  B 

in  turn  for  B  and  then  for  X.  L  and  U  are  upper  and  lower  diagonal  matrices 
respectively,  Le.  all  of  the  elements  above  (for  L)  or  below  (for  U)  the  main 
diagonal  are  0,  .so  these  two  equations  earn  be  easily  solved  by  forward  and  back¬ 
ward  substitution  respectively.  If  the  equation  AX  -  Y  is  solved  many  times  with 
different  values  for  Y.  then  this  method  is  more  efficient  than  solving  the  origi¬ 
nal  equation  ( AX=Y)  repeatedly  by  say,  gaussian  elimination  [Dahl74]. 

LU  decomposition  is  one  step  in  the  inner  loop  of  the  circuit  simulation  pro¬ 
gram  SPICE,  so  it  must  be  executed  repeatedly  on  each  circuit  simulation  run 
[Nage75].  The  parallel  program  used  in  these  experiments  performs  the  decom¬ 
position  by  using  Doolittle's  algorithm  [Chua75].  The  matrices  used  in  this  appli¬ 
cation  are  sparse,  but  not  necessarily  banded,  making  other  techniques,  e.g. 
systolic  methods  [MeadBOj,  less  attractive. 

Given  a  sparse  matrix,  the  parallel  program  was  generated  by  first  creating 
uniprocessor  code  for  performing  the  computation,  analyzing  the  data  depen¬ 
dencies  within  this  code,  and  then  creating  a  parallel  program  from  the  data 
dependency  graph  [YuB4,  WingSO].  For  the  program  in  question,  the  communica¬ 
tion  pattern  which  results  from  this  process  is  global,  i.e.  every  task  sends  mes- 


sages  to  every  other  task.  The  program  is  P1P0  since  LU  decomposition  is  only 
one  of  several  parallel  job  steps  in  the  inner  loop  for  SPICE.  Input  (output) 
values  can  be  expected  to  arrive  from  (be  sent  to)  another  parallel  program  exe¬ 
cuting  the  previous  (subsequent)  step  of  the  inner  loop. 

3.3.6.  Artificial  Traffic  Loads  (global  PIPO,  12  tasks) 

A  program  creating  synthetic  traffic  loads  using  random  number  genera¬ 
tors  was  also  studied.  In  the  discussion  which  follows,  this  program  is  referred 
to  as  the  “Random'*  program.  In  contrast  to  the  other  application  programs, 
this  program  does  not  perform  any  useful  computation.  Its  only  function  is  to 
generate  traffic  for  the  communication  network.  The  program  consists  of  12 
tasks,  each  of  which  sends  a  total  of  503  single-packet  messages.  Messages  are 
uniformly  distributed  among  other  tasks,  implying  global  communications. 
Since  each  processor  originates  its  own  messages,  in  contrast  to  a  single  proces¬ 
sor  generating  all  messages,  the  external  I/O  structure  is  PIPO.  The  mean  time 
between  messages  is  chosen  from  an  exponential  distribution.  Loading  on  the 
network  is  increased  by  reducing  the  average  time  between  messages. 

3.4.  Communication  Delays 

Figure  3.7a  shows  the  performance  of  these  application  programs  using  a 
fixed-delay,  infinite-bandwidth  switch.  Speedup,  which  is  defined  as  the  execu¬ 
tion  time  of  the  program  on  a  uniprocessor  divided  by  the  execution  time  on  the 
multicomputer,  is  plotted  as  a  function  of  communication  delay.  Here,  delay 
refers  to  the  end-to-end  delay  to  send  a  message  along  a  virtual  circuit.  It  is 
assumed  that  this  delay  is  the  same  along  all  circuits.  Since  the  switch  provides 
unlimited  bandwidth,  any  number  of  processors  may  simultaneously  send  mes- 
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figure  3.7.  (c)  Speedup  vs.  operating  system  overhead. 


The  two  programs  using  global  communications,  Barnwell  and  LU.  experi¬ 
ence  a  severe  degradation  in  performance  as  communication  delays  increase. 
This  results  from  the  relatively  fine  "granularity"  of  the  computation,  in  which 
communications  are  frequent  and  delays  have  a  significant  impact  on  total  exe¬ 
cution  time.  On  the  other  hand,  the  FFT  and  Block  State  programs  exhibit  little 
performance  degradation  as  delays  increase.  These  programs  are  pipelined,  so 
delays  only  affect  the  amount  of  time  required  to  fill  and  empty  the  pipe.  Once 
the  pipeline  is  filled,  data  arrives  at  each  processor  at  a  constant  rate,  indepen¬ 
dent  of  communication  delay,  so  all  of  the  processors  remain  busy.  It  is  errone¬ 
ous  however,  to  conclude  that  the  interconnection  switch  does  not  impact  the 
performance  of  these  programs,  since  the  curves  in  figure  3.7a  assume  unlim¬ 
ited  network  bandwidth. 

The  curves  in  figure  3.7b  show  the  performance  of  the  programs  as  a  func¬ 
tion  of  network  bandwidth.  Conceptually,  the  network  can  bs  viewed  as  an  entity 
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which  provides  a  certain  amount  of  bandwidth  (this  quantity  is  plotted  on  the 
horizontal  axis  in  figure  3.7b)  for  transmitting  messages.  The  optimistic 
assumption  is  made  that  all  of  the  network’s  bandwidth  can  be  allocated  to  a 
single  virtual  circuit  on  demand.  In  the  simulator,  this  is  implemented  by  using 
an  "ideal  bus"  switch  model.  The  communication  network  consists  of  a  single 
bus  of  the  indicated  bandwidth.  The  full  bandwidth  of  the  bus  is  allocated  to 
messages  as  they  are  generated.  Conflicts  to  access  the  bus  are  queued  in  FIFO 
sequence,  and  propagation  delays  along  the  bus  are  assumed  to  be  zero.  The 
curves  indicate  that  although  the  performance  of  the  pipelined  programs  is 
insensitive  to  communication  delay,  adequate  network  bandwidth  is  required  to 
achieve  good  performance.  The  programs  exhibiting  global  communication  pat¬ 
terns  behave  similarly.  The  LU  program  in  particular,  is  seen  to  require  very 
large  amounts  of  bandwidth  before  achieving  good  performance.  Simulations  at 
higher  bandwidths  indicate  that  a  500  Mbit/second  network  is  required  to 
achieve  a  speedup  of  10.0  (speedup  with  an  infinite  bandwidth,  zero  delay  switch 
is  12.7). 

Finally,  the  curves  in  figure  3.7c  indicate  performance  as  a  function  of 
operating  system  overhead.  Here,  overhead  is  measured  as  the  time  required  to 
execute  an  operating  system  routine  for  sending  or  receiving  a  message. 
Transmission  delays  are  assumed  to  be  zero.  It  is  seen  that  degradation  is 
severe  when  delays  in  the  operating  system  are  only  a  few  tens  of  microseconds. 
This  result  again  is  a  consequence  of  the  relatively  fine  granularity  of  the  com¬ 
putation.  It  points  out  that  hardware  support  for  operating  system  primitives 
(here  sending  and  receiving  messages)  is  required  to  allow  full  exploitation  of 
the  parallelism  inherent  in  many  programs.  'With  a  traditional  software  imple¬ 
mentation,  the  time  spent  in  the  operating  system  will  dominate  the  transmis¬ 
sion  time,  negating  the  benefits  of  incorporating  a  high-performance  communi- 
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cation  network.  In  particular,  since  recovery  from  transmission  errors  is  left  to 
an  end-to-end  protocol,  hardware  support  should  be  employed  in  the  computa¬ 
tion  processor  to  keep  these  checks  from  degrading  performance.  Hardware 
support  for  communication  primitives  thus  represents  an  important  area  of 
future  research. 

3.5.  Issues  Under  Investigation 

Four  separate  issues  are  studied  in  these  simulation  experiments.  The  first 
explores  the  optimal  number  of  ports,  and  compares  simulation  results  with 
those  predicted  by  the  analytical  models  presented  earlier.  Next,  an  alternative 
model  in  which  processor  and  communications  are  integrated  onto  the  same 
chip  is  studied.  Third,  since  many  of  the  application  programs  send  the  same 
message  to  several  different  destinations,  the  impact  of  incorporating  a 
mechanism  which  efficiently  handles  such  messages,  i.e.  a  multicast  mechan¬ 
ism,  is  investigated.  Finally,  the  particular  mapping  of  tasks  to  processors 
which  was  used  in  these  experiments  is  examined,  as  well  as  its  impact  on  the 
simulation  results. 

In  order  to  evaluate  the  optimal  number  of  ports,  two  types  of  switch 
models  were  implemented:  cluster  nodes  and  networks  with  a  fixed  number  of 
components.  These  switch  models  correspond  to  the  networks  discussed  in  the 
analytical  studies  presented  earlier.  In  the  first,  each  node  of  a  topology  requir¬ 
ing  b  branches  per  node  is  implemented  with  a  cluster  of  j>-port  communication 
components.  As  p  is  reduced,  the  number  of  components  required  to  construct 
the  network  is  increased.  Thus,  the  cluster  node  switch  models  do  not  keep  the 
chip  count  constant.  The  second  set  of  switch  models  compares  networks  with 
different  values  of  p,  but  with  approximately  the  same  number  of  components. 

In  addition  to  networks  constructed  from  separate  computation  and  com¬ 
munication  components,  networks  with  processor  and  communications 
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integrated  onto  the  same  chip  are  studied.  This  is  the  building  block  for  the 
"network  computer"  proposed  by  Wittie  [WittBl].  In  this  model,  the  communica¬ 
tion  links  between  the  computation  and  communication  domains  are  eliminated. 
In  communication  component  networks,  it  will  be  seen  that  these  links  some¬ 
times  become  bottlenecks  which  bias  the  results.  The  simulations  under  this 
latter  model  eliminate  this  bias.  Multicomputers  using  the  Wittie  model  do 
require  more  circuitry  per  chip  than  those  using  communication  components, 
making  direct  comparisons  unfair.  Nevertheless,  it  is  included  as  an  alternative 
model  for  multicomputer  networks. 

Since  the  digital  filtering  algorithms  (Barnwell,  Block  State,  and  Block  I/O) 
involve  transmitting  the  same  data  to  several  destinations,  a  mechanism  which 
efficiently  distributes  multiple-destination  packets  (i.e.  a  "multicast”  mechan¬ 
ism)  is  expected  to  improve  performance. 

If  a  multicast  mechanism  is  nof  used,  several  "single  destination”  packets 
are  generated  at  the  source  node,  one  for  each  destination,  and  each  is  routed 
separately  through  the  network  to  its  particular  destination  using  a  shortest 
path  routing  algorithm.  If  one  traces  the  paths  followed  by  these  packets 
through  the  network,  it  is  seen  that  packets  will  follow  each  other  up  to  a  cer¬ 
tain  point,  at  which  time  they  part  and  go  their  separate  ways.  The  multicast 
mechanism  combines  the  single  destination  packets  which  are  "following  -each 
other"  into  a  single  "multicast  packet".  A  new  copy  is  not  generated  until  one 
or  more  of  the  single  destination  packets  incorporated  into  the  multicast  packet 
need  to  "go  their  separate  ways".  If  several  packets  breaking  off  like  this  are  all 
going  in  the  same  direction,  only  one  new  multicast  packet  is  created.  Multicast 
and  broadcast  mechanisms  are  described  more  fully  in  [Dala78,  BharB3, 
McQu7B].  Note  that  since  virtual  circuits  are  used,  implementation  of  this  does 
not  affect  other  parameters  of  the  switching  network.  A  longer  header  might  be 
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needed  to  provide  a  list  of  destination  nodes.  However,  in  a  virtual  circuit 
mechanism,  this  information  need  only  be  carried  through  the  network  when  the 
multicast  circuit  is  set  up. 

The  mapping  of  the  application  program  onto  the  network  topology  is 
identified  by  lauds  assigned  to  tasks  and  processors.  As  shown  in  figures  3.3-3.6 
(the  remaining  two  programs,  Random  and  LU,  use  global  communications,  so 
the  mapping  does  not  influence  the  results),  each  task  of  each  application  pro¬ 
gram  is  characterized  by  a  unique  integer  called  its  "task  id".  Similarly,  each 
node,  i.e.  processor,  of  a  topology  is  characterized  by  a  unique  node  number.  In 
the  discussions  which  follow,  task  i  always  executes  on  processor  i.  Thus,  the 
simulation  results  assume  a  specific  mapping  of  tasks  to  processors.  Care  must 
be  taken  to  ensure  that  this  mapping  does  not  bias  the  results.  More  will  be  said 
about  this  later. 


3l6.  Simulation  Results  on  Cluster  Node  Networks 

As  discussed  earlier,  one  can  implement  a  node  of  a  topology  requiring  b 
branches  per  node  as  a  cluster  of  p-port  communication  components.  The  vari¬ 
ous  application  programs  described  above  were  run  on  Simon  using  switch 
models  for  several  different  cluster  node  networks.  The  results  of  these  simula¬ 
tion  experiments  are  reported  in  this  section. 

For  this  study,  four  topologies  are  examined  which  vary  6  over  a  wide  range 
of  values.  Ail  tocologies  are  assumed  to  use  full  duplex,  bidirectional  links. 
These  topologies  are: 

(1)  Fully  connected  network. 

(2)  Full-ring  binary  tree  TDespTe). 

(3)  Butterfly  network. 
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(4)  Ring  network. 

The  topology  within  each  cluster  node  is  a  balanced  tree,  with  the  processor 
attached  to  the  communication  component  at  the  root. 

In  all  of  the  graphs  which  follow,  speedup  is  plotted  as  a  function  of  B ,  the 
total  I/O  bandwidth  of  the  communication  chip.  The  only  exception  is  the 
artificial  traffic  load  program  in  which  average  message  delay  is  plotted  as  a 
function  of  traffic  load.  It  is  assumed  that  the  bandwidth  B  is  equally  divided 
among  the  existing  communication  links.  Thus,  a  Y-component  with  B  equal  to 
300  Mbit/second  has  three  100  Mbit/second  communication  links.  For  com¬ 
parison,  the  speedup  on  a  multicomputer  with  an  infinite-bandwidth,  zero-delay 
interconnection  system  (i.e.  a  "perfect  switch")  is  also  shown.  The  perfect 
switch  assumes  that  messages  arrive  at  their  destination  at  the  instant  at  which 
they  are  sent.  It  thus  gives  an  upper  bound  on  performance  for  any  communica¬ 
tion  network. 

The  analytical  results  presented  earlier  indicated  that  cluster  node  net¬ 
works  constructed  from  communication  components  with  a  small  number  of 
ports  yielded  the  most  bandwidth  and  least  delay.  Thus,  one  would  expect  net¬ 
works  constructed  from  Y-components  to  yield  the  best  performance.  It  will  be 
seen  that  the  simulation  results  confirm  this  conclusion. 

3.6.1.  Fully  Connected  Networks 

The  fully  connected  network  is  formed  by  placing  a  single  link  between 
every  pair  of  nodes.  Here,  the  number  of  nodes  is  equal  to  the  number  of  tasks 
required  by  the  parallel  program,  and  it  thus  varies  from  application  to  applica¬ 
tion.  This  topology  minimizes  the  number  of  hops  between  every  pair  of  nodes, 
but  at  the  expense  of  a  larger  number  of  branches  on  each  node. 


Three  of  the  application  programs  were  run  on  Simon  with  switch  models 
for  fully  connected  networks.  Performance  curves  are  shown  in  figures  3.8a-c. 
Due  to  limited  amounts  of  computing  resources,  cluster  node  simulations  for  the 
FFT,  Block  1/0,  and  Block  State  programs  are  not  available.  Figures  3.8a-c  indi¬ 
cate  that  performance  improves  as  the  number  of  ports  is  reduced  in  agree¬ 
ment  with  the  analytical  results.  Curves  labelled  "P+C"  indicate  that  processor 
and  communications  circuitry  are  incorporated  onto  the  same  chip. 

The  curves  resulting  from  the  artificial  traffic  load  program  indicate  that 
reducing  the  number  of  ports  reduces  the  average  delay  in  a  lightly  loaded  net¬ 
work.  and  increases  total  network  bandwidth.  The  bandwidth  result  however,  is 
somewhat  misleading  because  the  total  network  bandwidth  shown  in  figure  3.8c 
is  limited  by  the  link  between  the  processor  and  its  communication  component. 
This  is  demonstrated  by  the  curve  in  which  communication  circuitry  is  included 
in  the  same  chip  as  the  processor.  Network  bandwidth  is  increased  significantly 
when  this  bottleneck  is  removed. 

Figure  3.0d  shows  the  curves  for  the  artificial  traffic  load  program  with  this 
bottleneck  link  removed.  Here,  it  is  assumed  that  the  root  component  of  each 
cluster  node  has  both  computing  and  switching  capabilities.  Other  components 
only  perform  switching  functions.  As  expected,  delay  in  lightly  loaded  networks 
improves  as  the  number  of  ports  is  reduced.  The  curves  also  indicate,  however, 
that  networks  with  a  large  number  of  ports  provide  as  much  bandwidth  as  those 
using  Y-components.  This  unexpected  result  is  a  consequence  of  the  lack  of 
store-and-forward  communications  in  the  fully  connected  network,  and  the  par¬ 
ticular  traffic  distribution  created  by  the  artificial  traffic  load  generator. 

The  bandwidths  indicated  by  figure  3.8d  represent  the  minimum  of  two 
quantities: 
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(1)  The  maximum  bandwidth  provided  by  the  network. 

(2)  The  maximum  rate  at  which  traffic  can  be  generated  by  the  processors. 

If  the  first  quantity  is  the  limiting  factor,  then  the  tradeoff  between  hop  count 
and  link  bandwidth  discussed  earlier  determines  the  optimal  number  of  ports.  If 
the  second  quantity  limits  performance,  then  the  utilization  of  the  links  around 
the  processors  generating  messages  determines  performance.  The  more 
efficiently  these  links  are  used,  the  greater  the  amount  of  traffic  sent  into  the 
network,  and  the  higher  the  overall  bandwidth.  This  quantity  is  maximized  if  the 
traffic  generated  by  each  processor  is  evenly  distributed  across  that  processor's 
output  links,  since  this  implies  that  on  the  average,  all  of  the  processor's  links 
will  be  busy  all  of  the  time.  An  uneven  distribution  causes  some  links  to  be  over¬ 
loaded  while  others  become  idle,  reducing  the  total  traffic  flow  into  the  network. 

In  the  artificial  traffic  load  program,  messages  from  each  task  are  uni¬ 
formly  distributed  among  all  other  tasks.  Thus,  this  program  is  a  "perfect 
match"  with  the  fully  connected  network  with  processor  and  communications 
integrated  onto  the  same  chip,  since  a  direct  link  exists  between  each  pair  of 
communicating  tasks.  Because  of  the  uniform  traffic  distribution,  all  links  are 
equally  utilized,  and  the  amount  of  traffic  generated  by  the  processors  is  max¬ 
imized.  This  rate  determines  the  bandwidths  shown  in  figure  3.8tL 

Creating  a  new  network  by  adding  switching  components  increases  the 
amount  of  traffic  the  network  can  carry,  but  the  amount  of  traffic  which  can  be 
generated  is  not  increased.  Thus,  this  additional  network  bandwidth  cannot  be 
utilized.  In  fact,  performance  will  actually  be  degraded  if  the  new  network  does 
not  preserve  the  equal  utilization  of  processor  links  described  above.  This 
phenomena  explains  the  poor  performance  of  some  of  the  networks  in  figure 
3.8d. 
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The  behavior  described  above  is  atypical  because  the  global/uniform  traffic 
pattern  of  the  artificial  traffic  load  generator  is  not  always  appropriate.  It  will 
be  seen  that  other  topologies  and  different  traffic  pattern  j  yield  results  favoring 
a  small  number  of  ports.  Indeed,  performance  curves  for  the  other  application 
programs  (figures  3.8a-b)  indicate  that  networks  using  communication  com¬ 
ponents  with  a  small  number  of  ports  achieve  better  performance  than  networks 
with  a  large  number  of  ports,  even  if  the  latter  have  the  added  advantage  of 
including  communication  circuitry  on  the  same  chip  as  the  processor.  Networks 
with  a  small  number  of  ports  and  processor  and  communications  on  the  same 
chip  will  perform  even  better,  widening  this  gap. 

The  curves  for  Barnwell's  algorithm  indicate  that  a  significant  performance 
improvement  results  from  incorporating  a  multicast  mechanism  in  the  com¬ 
munication  hardware.  If  no  multicast  mechanism  is  provided,  the  processor 
sending  the  message  must  send  a  separate  copy  to  each  destination.  A  queue 
appears  instantly  in  the  processor  sending  the  message,  leading  to  long  delays 
and  poor  performance. 

No  multicast  curve  is  shown  when  processor  and  communications  are  incor¬ 
porated  onto  the  same  chip.  This  is  because  networks  with  and  without  a  multi¬ 
cast  mechanism  behave  identically  under  these  circumstances.  Since  each  pro¬ 
cessor  has  a  direct  link  to  every  other  processor,  all  "splitting  apart"  of  the 
multicast  packet  is  done  at  the  source  node.  A  network  without  a  multicast 
mechanism  behaves  in  exactly  the  same  way  for  this  topology. 

3.6.2.  Full-Ring  Tree  Networks 

The  second  topology  is  the  full-ring  binary  tree  [Desp78].  This  topology  is 
constructed  from  a  binary  tree  by  adding  links  between  siblings  and  cousins,  as 
shown  in  figure  3.9.  The  average  hop  count  grows  logarithmically  with  the 
number  of  nodes,  while  the  number  of  branches  per  node  remains  fixed  at  5. 


Figure  3.9.  Full  ring  binary  tree. 
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Figure  3.10.  Full  ring  tree  (a)  Barnwell. 


Performance  curves  for  full-ring  tree  networks  are  shown  in  figures  3.10a-f. 
Qualitatively,  these  curves  agree  with  those  presented  for  the  fully  connected 
networks.  Again,  components  with  a  small  number  of  high-bandwidth  links 
achieve  the  best  performance. 

The  performance  curves  for  Barnwell’s  algorithm  (figure  3.10a)  indicate 
that  as  the  1/0  bandwidth  of  the  communication  components  is  increased, 
speedup  increases  quickly  at  first,  but  becomes  more  gradual  at  higher  link 
bandwidths.  Other  curves  however,  such  as  some  of  those  for  the  FFT  program 
(figure  3.10d),  indicate  a  linear  Increase  In  speedup.  These  differences  arise 
from  the  nature  of  the  communication  patterns  for  the  different  programs.  The 
linear  behavior  arises  when  one  virtual  circuit  remains  the  critical  path  for  the 
program  as  chip  bandwidth  is  varied.  As  bandwidth  is  increased,  delay,  and  thus 
execution  time,  decrease  in  proportion.  The  pipelined  programs  often  demon¬ 
strate  this  behavior,  with  the  longest  path  from  the  first  stage  of  the  pipeline  to 
the  last  forming  the  critical  path.  Many  of  the  SISO  programs  also  demonstrate 
this  behavior.  Here,  the  bottleneck  is  in  distributing  the  initial  data  samples  to 
the  computations  processors.  The  problem  is  aggravated  if  multiple-destination 
messages  are  required  to  distribute  the  samples,  as  is  the  case  in  the  Block  1/0 
(figure  3.10b)  and  Block  State  (figure  3.10c)  programs,  particularly  if  the  net¬ 
work  does  not  include  a  multicast  mechanism.  The  speed  of  the  links  around 
the  input  processor  becomes  the  primary  factor  which  determines  the  execu¬ 
tion  time  of  the  program.  Nonlinear  behavior  results  when  no  single  virtual  cir¬ 
cuit  dominates  performance  across  all  chip  bandwidths.  Instead,  delays  on  a 
number  of  virtual  circuits  determine  the  overall  execution  time.  Both  types  of 
behavior  will  be  seen  in  the  performance  curves  which  follow. 

The  curves  for  the  artificial  traffic  load  program  (figure  3.10f)  again  indi¬ 
cate  that  delay  and  bandwidth  are  both  improved  as  the  number  of  ports  is 
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reduced.  The  curve  with  processor  and  communications  integrated  onto  the 
same  chip  indicates  that  the  link  between  the  processor  and  communication 
component  is  still  a  bottleneck,  since  the  bandwidth  provided  is  significantly 
better  than  that  provided  by  networks  using  components  with  4,  5,  or  6  ports 
per  node.  This  bandwidth  is  still  somewhat  less  than  that  of  the  Y-component 
network  however,  even  though  the  latter  is  handicapped  by  this  bottleneck  link. 
This  adds  further  support  to  components  with  a  small  number  of  ports. 

3.6.3.  Butterfly  Networks 

The  32  node  butterfly  network  shown  in  figure  3.11  is  the  third  topology  stu¬ 
died.  The  butterfly  is  similar  to  the  tree  to  the  extent  that  the  average  hop 
count  grows  logarithmically  with  the  number  of  nodes.  Four  branches  are 
required  for  each  node.  This  topology  is  more  symmetric  than  the  tree  however, 
and  thus  is  less  susceptible  to  bottlenecks  for  applications  exhibiting  global 
traffic  patterns.  The  butterfly  is  ideally  suited  for  the  FFT  application  program. 

Performance  curves  for  the  butterfly  network  are  shown  in  figures  3.12a-e. 
Due  to  the  excessive  amount  of  computing  resources  required,  a  curve  for  the 
Block  1/0  program  could  not  be  produced.  Networks  constructed  with  com¬ 
ponents  using  a  small  number  of  ports  again  achieve  the  best  performance.  The 
FFT  program  (figure  3.12c)  performs  unusually  well  at  low  chip  bandwidths, 
demonstrating  the  reduction  in  bandwidth  requirements  when  a  good  mapping  is 
found  between  the  application  program  and  hardware.  The  curve  with  processor 
and  communications  on  the  same  chip  in  the  artificial  traffic  load  program 
(figure  3.12e)  indicates  that  the  link  between  the  processor  and  communication 
component  is  not  a  serious  bottleneck.  The  bandwidth  provided  is  equal  to  that 
of  the  network  using  communication  components  with  the  same  number  of 
ports. 
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3.6.4.  Ring  Networks 

The  fourth  topology,  a  bidirectional  ring,  minimizes  the  number  of  branches 
per  node,  but  maximizes  the  average  hop  count.  Like  the  fully  connected  topol¬ 
ogy,  the  number  of  nodes  is  equal  to  the  number  of  tasks  in  the  application  pro¬ 
gram. 

Performance  curves  for  the  ring  network  are  shown  in  figures  3. 13a-f.  In 
ring  topologies,  the  use  of  communication  components  also  implies  the  use  of 
Y-components,  since  only  three  ports  per  chip  are  required.  Thus,  only  two  dis¬ 
tinct  networks  need  to  be  compared.  The  network  with  communication  circuitry 
on  the  processor  chip  yields  better  performance  since  it  has  higher  bandwidth 
links  (only  2  ports  are  needed)  and  smaller  hop  counts.  Thus,  networks  con¬ 
structed  with  components  with  a  small  number  of  ports  again  achieve  the  best 
performance. 

The  only  exception  occurs  for  the  Block  I/O  program.  Here,  the  communi¬ 
cation  component  networks  achieve  better  performance  at  high  chip 
bandwidths.  This  behavior  is  a  consequence  of  the  S1S0  behavior  of  the  pro¬ 
gram.  When  execution  begins,  the  “input"  processor  broadcasts  data  values  to 
a  number  of  computation  processors.  In  ring  networks  without  communication 
components,  this  causes  the  links  around  this  processor  to  become  saturated, 
blocking  traffic  produced  by  the  other  processors  since  messages  are  serviced 
at  each  node  by  a  strict  FIFO  ordering.  As  a  result,  the  signal  processing  calcu¬ 
lations  cannot  proceed  until  this  initial  backlog  of  traffic  is  cleared  up,  slowing 
down  the  computation.  If  communication  components  are  used,  the  link 
between  the  input  processor  and  its  communication  component  becomes 
saturated;  however,  this  link  does  not  block  traffic  among  other  processors.  The 
arrival  of  the  input  data  messages  at  the  communication  component  is  spread 
out  over  time.  Thus,  these  messages  do  not  completely  block  other  traffic, 
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Figure  3.13.  Ring  (c)  Block  State,  (d)  FFT. 
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although  they  do  increase  congestion.  In  general,  a  priority  mechanism  could 
be  used  to  avoid  this  anomaly:  traffic  generated  by  the  computation  processors 
can  be  assigned  a  higher  priority  than  the  input  messages.  At  low  bandwidths, 
the  performance  of  the  communication  component  networks  is  limited  by  the 
bandwidth  of  the  link  from  the  input  processor  to  its  communication  com¬ 
ponent,  allowing  networks  with  processor  and  communications  on  the  same  chip 
to  yield  better  speedup. 

The  "blocked  traffic"  behavior  described  above  is  not  as  prominent  in  the 
other  S1S0  programs,  Barnwell  and  Block  State.  In  Block  State,  the  traffic  within 
the  pipeline  exhibits  enough  locality  that  it  can  avoid  the  congested  area  around 
the  input  processor.  In  Barnwell,  the  input  message  traffic  is  single  destination, 
in  contrast  to  Block  State  where  the  input  traffic  is  multiple  destination,  and 
thus  does  not  create  as  much  congestion.  The  "jump”  in  performance  around 
110  Mbit/chip  in  one  of  the  Barnwell  curves  is  caused  by  a  fortuitous  shift  in  the 
traffic  pattern  which  causes  an  unusual  reduction  in  queueing  delays  along  one 
of  the  links;  it  does  not  reflect  any  general  principles  of  behavior. 

3.6.5.  Conclusions  for  Cluster  Node  Networks 

The  simulation  results  for  cluster  node  networks  are  in  agreement  with  the 
analytical  results  presented  earlier.  Networks  constructed  from  components 
using  a  small  number  of  ports  yield  less  delay  than  networks  using  components 
with  many  ports.  Bandwidth  can  be  increased  by  adding  more  components  to 
the  communication  domain.  Eventually,  as  network  bandwidth  is  increased,  the 
rate  at  which  processors  can  generate  traffic  limits  performance,  rather  than 
the  bandwidth  of  the  network.  Also,  the  programs  using  multiple-destination 
communications  show  a  significant  performance  improvement  if  the  communi¬ 
cation  circuitry  includes  a  multicast  mechanism. 
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3.7.  Smulation  Results  on  Networks  with  a  Fixed  Number  of  Components 

Cluster  nodes  constructed  from  components  with  a  large  number  of  ports 
require  fewer  components  than  those  constructed  with  a  small  number  of  ports. 
Thus,  the  studies  presented  above  do  not  consider  chip  count.  In  this  section, 
networks  using  the  same  number  of  components  are  considered.  The  applica¬ 
tion  programs  are  executed  on  lattice  and  tree  topology  networks  like  those 
analyzed  earlier  It  will  be  seen  that  bottlenecks  form  around  the  root  of  the 
tree  networks,  biasing  the  results  to  favor  components  with  a  small  number  of 
high  bandwidth  links.  De  Bruijn  networks  are  examined  as  an  example  of  a  class 
of  network  topologies  with  logarithmic  average  hop  count,  but  without  this 
inherent  bottleneck. 

The  analytical  results  indicated  that  networks  constructed  from  com¬ 
ponents  with  a  small  number  of  ports  yielded  lower  delay,  but  less  bandwidth 
than  networks  using  components  with  a  large  number  of  ports.  Based  on  these 
results,  one  would  expect  networks  using  a  large  number  of  ports  to  yield  better 
performance  when  the  network  is  bandwidth  limited.  Intuitively,  as  we  move 
toward  networks  with  a  larger  number  of  (slower)  links,  the  average  hop  count  is 
reduced,  and  additional  paths  are  created  in  the  network.  These  trends  com¬ 
bine  to  reduce  traffic  on  congested  links.  If  the  reduction  in  congestion  is 
significant,  it  will  more  than  offset  the  disadvantage  of  using  slower  links,  and 
overall  performance  improves.  Of  course,  if  the  network  provides  adequate 
bandwidth  for  the  traffic  load  presented  to  it,  then  the  queueing  delays  will  be 
email,  and  networks  using  a  larger  number  of  ports  can  only  achieve  poorer  per¬ 
formance  since  link  speed  is  reduced.  Thus,  networks  with  a  large  number  of 
ports  can  be  expected  to  provide  better  performance  when  the  traffic  load  is 
heavy  relative  to  total  network  bandwidth,  but  networks  with  a  small  number  of 
ports  can  be  expected  to  perform  better  otherwise. 
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3.7.1.  Lattice  Topologies 

The  application  programs  were  run  with  switch  models  for  the  lattice  topo¬ 
logies  shown  in  figure  3.14a-c.  Performance  curves  are  shown  in  figures  3. 15a-f. 
Hie  FFT  program  exhibits  better  performance  with  a  large  number  of  ports,  as 
would  be  expected  in  bandwidth-limited  networks.  The  remaining  programs 
however,  indicate  little  performance  variation  as  the  number  of  ports  is  varied, 
or  better  performance  with  a  small  number  of  ports.  One  reason  for  this  is  that 
most  of  the  programs  encounter  bottlenecks  which  are  not  alleviated  when  the 
number  of  ports,  and  thus  the  number  of  paths  through  the  network,  is 
increased.  In  the  SISO  programs  for  example,  the  bottleneck  is  around  the 
input  processor,  and  performance  is  determined  to  a  large  extent  by  the  speed 
of  the  communication  links  around  this  congested  area.  Since  components  with 
a  small  number  of  ports  use  faster  links,  they  achieve  better  performance. 


Figure  3. 14.  Lattices  (a)  3  ports. 
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The  ueia>/ band  width  curves  for  the  artificial  traffic  load  program  indicate 
that  networks  with  a  small  number  of  ports  achieve  better  delay  and  bandwidth. 
The  bandwidth  result  disagrees  with  the  analytical  results  presented  earlier, 
which  suggested  that  reduced  hop  counts  would  allow  networks  using  com¬ 
ponents  with  a  large  number  of  ports  to  achieve  higher  throughput.  The  reason 
for  the  disagreement  is  that  for  high  traffic  loads,  the  processor/communication 
component  link  hormmes  a  bottleneck.  Performance  is  thus  determined  by  the 
speed  of  this  bottleneck  link. 

In  figures  3.16a-f,  this  bottleneck  is  removed  by  assuming  that  communica¬ 
tion  circuitry  is  integrated  onto  the  same  chip  as  the  processor.  The  curves  for 
the  artificial  traffic  load  program  are  in  closer  agreement  with  the  analytic 
results  presented  earlier,  however,  the  results  for  the  other  application  pro¬ 
grams  are  qualitatively  the  same.  It  is  interesting  to  note  that  some  of  the  SISO 
programs.  Block  State  and  Block  I/O  in  particular,  experience  lower  perfor¬ 
mance  when  this  latter  model  is  used.  This  anomalous  effect  can  be  attributed 
to  the  "blocking  problem"  described  earlier  in  the  ring  topology  discussion. 
Messages  that  carry  the  input  samples  block  traffic  generated  by  the  computa¬ 
tion  processors. 

Finally,  the  convergence  of  some  of  the  multicast/non-multicast  curves  in 
the  Block  State  program  is  one  other  point  of  interest.  Recall  that  Block  State 
uses  multicast  to  distribute  the  initial  data  samples.  The  convergence  of  the 
curves  indicates  that  beyond  a  certain  bandwidth,  here  approximately  100M 
bit/chip,  the  network  can  provide  computation  processors  with  data  samples  as 
quickly  as  they  can  be  processed,  so  improved  performance  along  these  virtual 
circuits  results  in  no  improvement  in  overall  execution  time.  In  addition,  the 
network  provides  enough  bandwidth  that  the  additional  traffic  caused  by  the 
absence  of  a  multicast  mechanism  does  not  degrade  performance. 


3.7.2.  Tree  Topologies 

Performance  curves  for  tree  networks  (figure  3.17  shows  one  such  network) 
are  shown  in  figures  3.18a-g.  For  all  application  programs,  it  is  seen  that  net¬ 
works  built  from  components  with  a  small  number  of  ports  yield  better  perfor¬ 
mance  than  those  using  a  larger  number  of  ports,  even  when  processor  and 
communications  are  incorporated  onto  the  same  chip  (set  figure  3. 18g).  These 
results  however,  cue  a  consequence  of  congestion  around  the  root  node  rather 
than  from  the  hop  count/link  bandwidth  tradeoffs  discussed  earlier.  In  trees,  a 
disproportionate  amount  of  traffic  must  flow  through  the  root,  leading  to 
congestion  in  this  portion  of  the  network.  Increasing  the  number  of  links  does 
not  improve  the  amount  of  bandwidth  allocated  to  this  congested  area.  As  a 
result,  performance  is  determined  to  a  large  extent  by  the  speed  of  communica¬ 
tion  links  near  the  root.  Since  components  with  a  small  number  of  ports  have 
faster  links,  they  yield  higher  performance. 

a7.3.  De  Bruijn  Networks 

The  results  for  tree  topologies  were  biased  because  of  the  inherent 
bottleneck  around  the  root.  To  provide  a  true  test  of  the  analytical  results,  a 
class  of  topologies  is  required  which  does  not  have  this  inherent  bottleneck,  but 
which  also  has  an  average  hop  count  which  grows  logarithmically  with  the 
number  of  nodes.  The  class  of  topologies  must  be  general  to  the  extent  that  net¬ 
works  with  approximately  the  same  number  of  nodes  can  be  constructed  as  the 
number  of  ports  is  increased. 

One  class  of  topologies  which  satisfy  these  requirements  are  De  Bruijn  net¬ 
works  [Brui48],  De  Bruijn  networks,  which  are  only  defined  for  even  degree  (i.e. 
an  even  number  of  links  per  node),  are  the  densest  known  infinite  family  of 
undirected  graphs  of  even  degree  greater  than  4.  A  dense  graph  of  degree  p  is 
one  with  a  small  diameter.  Diameter,  which  is  specified  as  a  function  of  p  and 


Figure  3.17.  Tree  topology. 
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Figure  3. 18.  Trees  (a)  Barnwell. 
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the  number  of  nodes  in  the  graph,  is  defined  as  the  largest  distance  between  any 
pair  of  nodes,  where  distance  refers  to  the  length  of  the  shortest  path  between 
the  two  nodes.  Until  recently,  De  Bruijn  graphs  were  not  only  the  densest  family 
of  graphs  of  degree  greater  than  3,  but  De  Bruijn  graphs  of  degree  p  were  also 
denser  than  any  other  family  of  graphs  of  degree  p  +  1.  Recently  however, 
denser  graphs  have  been  discovered  for  degrees  3,  4,  5  [Lela82a,  Lela82bj.  Also, 
Cs'  graphs,  which  are  only  defined  for  odd  degrees  greater  than  3,  yield  smaller 
diameter  than  De  Bruijn  graphs  with  one  fewer  port  per  node  [FarhBl].  Still,  the 
De  Bruijn  networks  represent  a  set  of  graphs  with  logarithmic  hop  count  without 
the  “root  bottleneck"  inherent  in  trees,  and  thus  represent  an  attractive  topol¬ 
ogy  for  analyzing  the  optimum  number  of  ports. 

A  De  Bruijn  graph  is  characterized  by  two  parameters,  a  base  b  and  an 
integer  n.  The  graph  consists  of  bn  nodes.  The  address  of  each  node  is  defined 
by  a  string  of  digits,  *0*1  '  *  •  **-1.  where  0«xt<6.  The  addresses  of  nodes  which 
ere  directly  connected  to  X  are  derived  by  shifting  X’s  address  left  or  right  1 
digit,  and  shifting  in  a  new  digit  k,  (Xk<b.  Thus,  node  X  has  links  to  nodes 
yxQxl  -  •  •  zn_2  and  nodes  xl  ■  ■  •  where  y=0,l,  •  ■  •  6—1.  Each  node  has  up 

to  2x6  links  to  other  nodes.  From  this  definition,  it  is  clear  that  node  X  can 
reach  any  other  node  in  at  most  n  hops,  since  an  arbitrary  address  can  be  gen¬ 
erated  by  shifting  the  X  address  n  times.  The  topology  does  contain  some 
degenerate  cases.  For  example,  with  6=2,  nodes  00  •  •  •  0  and  11  •  •  •  1  have 
links  to  themselves,  and  nodes  0101...,  and  1010...  have  more  than  one  link 
between  them.  These  are  the  only  special  cases  however.  The  edges  of  the  De 
Bruijn  graph  yield  exactly  the  same  interconnection  as  the  permutation  network 
sometimes  called  the  single-stage  shuffle-exchange  [Ston71,  Ston72].  A  base  2,  8 
node  network  is  shown  in  figure  3.19. 


Figure  3. 19.  Base  2  De  Rruijn  network. 

For  this  study,  three  De  Bruijn  graphs  were  examined: 

(1)  6=2.  n =5  (32  nodes) 

(2)  6  =3,  n=3  (27  nodes) 

(3)  6=5.  n =2  (25  nodes). 

These  graphs  were  selected  since  they  have  roughly  the  same  number  of  nodes, 
and  also  provide  enough  processors  to  execute  most  of  the  application  programs 
(the  FFT  is  the  only  one  requiring  more  than  25  processors).  Communication 


components  for  these  graphs  require  5,  7,  and  11  ports  for  each  node,  respec¬ 
tively,  including  one  port  to  attach  to  the  node’s  computation  processor,  provid¬ 
ing  a  wide  range  in  values  for  p. 

The  performance  curves  for  the  De  Bruijn  networks  described  above  are 
shown  in  figures  3.20a-e.  Performance  with  processor  and  communication  circu¬ 
itry  integrated  onto  the  same  chip  are  shown  in  figures  3.21a-e.  The  results  are 
qualitatively  similar  to  those  of  the  lattice  topologies.  The  curves  indicate  that 
better  performance  is  achieved  when  components  with  a  small  number  of  ports 
are  used. 

3.7.4.  Conclusions  for  Networks  with  a  Fixed  Number  of  Components 

The  primary  result  of  these  simulation  studies  is  that  networks  constructed 
with  components  using  a  small  number  of  ports  achieve  better  performance 
than  those  using  a  large  number  of  ports.  In  some  cases,  this  is  in  disagreement 
with  the  results  of  analytical  studies.  This  is  normally  due  to  bottlenecks  that 
prevent  much  of  the  bandwidth  provided  by  the  network  to  be  utilized.  These 
bottlenecks  can  be  alleviated  by  using  components  with  a  small  number  of 
ports,  since  this  provides  maximum  bandwidth  for  the  concerned  links.  The 
bottlenecks  may  arise  from  the  application  program  (e.g.  the  SISO  programs 
here),  or  from  the  network  topology  (e.g.  trees).  The  limited  I/O  bandwidth  of 
the  processors  generating  messages  may  be  the  source  of  another  bottleneck. 
Finally,  the  simulations  also  demonstrate  that  significant  performance  improve¬ 
ments  can  be  achieved  if  mechanisms  are  included  for  efficient  handling  of 
multiple-destination  messages. 

3.8.  Influence  of  the  Happing  of  Tasks  to  Processors 

The  results  described  above  assumed  a  specific  algorithm,  to  be  discussed 
below,  for  mapping  application  programs  onto  the  network  topologies.  Care 
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Figure  3.20.  De  Bruijn  network  (a)  Barnwell,  (b)  Block  I/O. 


>3 


RANDOM:  DE  BRUIJN 


*.  .'.A  -r.  - 


dsuy 

(micro— con  d») 
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must  be  taken  to  ensure  that  this  mapping  algorithm  is  "equally  good"  for  the 
networks  being  compared,  or  else  the  differences  between  the  curves  may  just 
be  a  result  of  using  a  better  mapping  in  one  network  relative  to  another.  This 
section  addresses  the  question  of  how  the  quality  of  the  mapping  algorithm 
affects  the  results  presented  above. 

The  simulation  results  used  two  types  of  switch  models.  The  first  is  based 
on  the  cluster  node.  However,  cluster  node  networks  are  only  an  implementa¬ 
tion  of  a  given  topology.  Thus,  within  each  topology,  identical  mappings  are 
used,  and  no  cluster  node  network  is  favored  over  another. 

The  second  type  of  switch  model  compares  networks  with  different  topolo¬ 
gies.  Performance  curves  for  different  types  of  lattices,  trees,  and  De  Bruijn 
networks  are  compared.  In  order  to  characterize  the  "goodness"  of  a  mapping, 
a  quality  measure  must  be  established.  Changing  the  number  of  ports  affects 
link  speed  and  average  hop  count  Since  the  mapping  algorithm  has  no  impact 
on  link  speed,  but  does  affect  average  hop  count,  the  latter  is  an  appropriate 
measure.  In  particular,  as  the  number  of  ports  is  increased,  the  average  hop 
count  should  decrease  in  a  manner  similar  to  that  observed  in  the  analytical 
studies.  If  it  can  be  shown  for  each  application  program  that  the  average  hop 
count  decreases  "as  it  should",  then  it  can  be  concluded  that  the  mapping  algo¬ 
rithm  does  not  bias  the  results  to  favor  (say)  lattices  with  a  small  number  of 
ports.  On  the  other  hand,  if  increasing  the  number  of  ports  creates  an  unex¬ 
pectedly  small  (large)  improvement  in  average  hop  count,  then  a  better  map¬ 
ping  was  done  on  the  network  with  a  small  (large)  number  of  ports,  weakening 
(strengthening)  the  conclusion  that  a  small  number  of  ports  is  better. 

Average  hop  count  is  defined  for  an  application  program  a  as: 

B.  ■ 

where  dy  is  the  number  of  links  traversed  in  the  shortest  path  from  i  to  j,  and 
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y  is  the  total  naniLcr  of  messages  sent  into  the  network,  y y  is  the  total  number 
of  messages  sent  from  i  to  j.  This  definition  differs  from  that  presented  earlier 
because  the  earlier  definition  assumed  a  uniform  traffic  distribution,  implying 
all  paths  have  equal  weight.  Values  for  7?a  are  given  in  table  3.2  for  the  different 
application  programs. 


Table  3.2. 
Average  Hop  Count 
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l.b.  (1)  =  lower  bound  from  weighted  average  hop  count,  optimal  packing 


ant.  (g)  =  anticipated  from  20  node  network,  global  communications 


Two  mapping  schemes  were  used  in  the  application  programs  exhibiting 
local  communications  (Block  1/0,  Block  State,  and  FFT).  The  first  is  the  original 
mapping  whose  results  were  described  in  section  3.7.  This  mapping  was 
designed  to  minimize  the  average  hop  count  on  virtual  circuits  carrying  the  ini¬ 
tial  data  samples  in  the  SISO  programs.  If  the  processors  of  the  network  are 
envisioned  as  being  uniformly  distributed  across  the  surface  of  a  disk,  the  pro¬ 
cessor  distributing  the  data  samples  resides  in  the  center,  and  the  processors 
receiving  these  samples  are  packed  around  it. 

The  second  mapping  attempts  to  optimize  the  virtual  circuits  carrying  the 
local,  Le.  pipelined,  communication  traffic  among  the  computation  tasks.  This 
mapping  was  performed  for  lattice  topologies  only.  Figures  3.22a-e  indicate 
which  tasks  are  assigned  to  which  processors  for  this  latter  mapping.  The  FFT 
program  uses  the  same  task  number  assignments  that  are  shown  in  figure  3.6. 
Figures  3.23a-f  are  the  performance  curves  for  these  mappings  for  lattices  with 
communication  components  and  lattices  with  processor  and  communications 
integrated  onto  the  same  chip.  Except  for  the  FFT  program,  which  will  be  dis¬ 
cussed  later,  the  results  are  qualitatively  the  same  as  those  found  in  the  original 
mapping. 

Also  included  in  table  3.2  are  "anticipated”  values  for  Ha.  For  programs 
using  global  communications  (Barnwell,  LU,  and  Random),  this  anticipated  value 
is  computed  by  examing  the  average  hop  count  in  a  20  node  network,  assuming 
a  uniform  traffic  distribution.  For  programs  exhibiting  local  communications 
however  (Block  I/O,  Block  State,  and  FFT),  this  is  clearly  an  unrealistic  measure. 
Here,  a  lower  bound  value  for  anticipated  hop  count  is  used.  Suppose  the  appli¬ 
cation  program  requires  that  processor  X  must  send  messages  to  k  other  pro¬ 
cessors.  The  minimum  hop  count  from  X  to  these  processors  is  obtained  by 
assuming  that  the  k  processors  communicated  with  are  those  which  are  closest 
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Figure  3.22.  Second  Mapping  ( c )  6  parts,  (d)  Black  I/O  program. 
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Figure  3.22.  Second  Mapping  (e)  Block  state  program. 
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Figure  3.23.  Lattices  (2nd  mapping,  P+  C)  (f)  FFT. 

to  X.  By  repeating  this  computation  for  each  processor,  a  lower  bound  on  Ha  is 
obtained.  This  is  the  traffic  distribution  assumed  in  the  analytical  model 
presented  earlier  (networks  with  a  fixed  number  of  components),  except  the  uni¬ 
form  traffic  distribution  assumption  has  been  relaxed.  In  table  3.2,  no  antici¬ 
pated  value  is  listed  for  the  FFT  program  since  each  task  communicates  with 
only  2  other  tasks,  leading  to  T?.  =  1  for  all  topologies. 

Table  3.2  also  includes  fractional  improvements  in  average  hop  count  rela¬ 
tive  to  the  network  using  the  minimum  number  of  ports,  (p  =  3  for  lattices,  p- 4 
for  De  Bruijn)  as  the  number  of  ports  per  chip  is  increased.  In  comparing  the 
experimentally  measured  results  with  anticipated  and  lower-bound  hop  counts, 

It  is  seen  that  in  most  cases  the  experimental  results  are  comparable  to  or  sur¬ 
pass  the  anticipated  improvement.  This  implies  that  any  bias  introduced  by 
differences  in  the  mapping  algorithms  makes  a  large  number  of  ports  appear  in 
a  better  light  than  they  should,  thus  strengthening  the  conclusion  that  a  small  < 

number  of  ports  is  better. 
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Finally,  let  us  consider  the  impact  of  using  a  better  mapping  algorithm  on 
the  results  derived  so  far.  As  a  closer  match  is  found  between  the  communica¬ 
tion  structure  of  the  application  program  and  the  network  topology,  communi¬ 
cations  become  more  localized.  Processors  communicate  less  with  processors 
far  away,  so  the  reduction  in  average  path  length,  which  occurs  when  the 
number  of  ports  is  increased,  become  less  effective.  The  strength  of  the  argu¬ 
ment  for  a  large  number  of  ports  relies  on  a  reduction  in  network  congestion. 
However  as  the  mapping  is  improved,  congestion  becomes  less  significant. 
Increasing  the  number  of  ports  only  decreases  the  bandwidth  available  to  each 
virtual  circuit,  and  thus  degrades  performance.  Thus,  the  fact  that  the  simula¬ 
tion  results  may  not  use  an  "optimal**  mapping  of  tasks  to  processors  can  only 
bias  the  results  to  favor  a  large  number  of  ports,  strengthening  the  conclusion 
that  a  small  number  of  ports  is  better.  Some  support  for  this  conclusion  is  seen 
in  figure  3.23f,  where  an  optimized  mapping  for  the  FFT  program  yields  better 
performance  with  a  small  number  of  ports,  while  the  original  mapping  yielded 
better  performance  with  a  large  number  of  ports.  Similarly,  as  shown  in  figure 
3.12c,  the  execution  of  the  FFT  on  Ihe  butterfly  topology,  an  optimal  mapping, 
yields  better  performance  when  a  small  number  of  ports  is  used. 

3.9.  Precision  of  the  Simulations  Results 

A  certain  amount  of  uncertainty  exists  in  all  of  the  simulations  presented 
thus  far.  The  arrival  times  of  messages  at  each  node  is  a  function  of  the  queue¬ 
ing  delays  encountered  in  previous  nodes,  which  in  turn  depends  on  the  arrival 
times  of  other  messages.  These  complex  interactions  lead  to  message  delays 
wbich  vary  according  to  the  times  at  which  messages  are  generated.  In  general, 
the  application  programs  executed  on  the  multicomputer  are  not  known  a 
priori,  so  some  uncertainty  exists  in  the  times  at  which  messages  are  generated 
by  the  application  program,  leading  to  uncertainty  in  message  delays.  If  this 
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uncertainty  is  large,  the  results  presented  thus  far  could  be  based  on  chance 
behavior  rather  than  the  tradeoffs  described  earlier.  The  amount  of  this  uncer¬ 
tainty  will  now  be  examined. 

The  fact  that  the  programs  generated  a  relatively  large  number  of  mes¬ 
sages  (typically,  thousands)  combined  with  the  large  quantity  of  curves  produc¬ 
ing  the  same  qualitative  results  leads  one  to  suspect  that  the  conclusions 
presented  thus  far  are  not  simply  the  result  of  random  fluctuations.  Further¬ 
more,  the  fact  that  the  curves  yielded  results  which  were  consistent  with  those 
predicted  by  the  analytical  models  (or  in  disagreement  in  explainable  ways) 
strengthens  this  belief.  Nevertheless,  in  order  to  obtain  a  quantitative  measure 
of  the  uncertainty  described  above,  simulations  were  repeated  with  fluctuations 
introduced  in  the  times  at  which  messages  are  generated. 

The  artificial  traffic  generator  program  was  executed  on  the  hexagonal  lat¬ 
tice  network  of  figure  3.14a  for  the  case  of  processor  and  communication  circui¬ 
try  integrated  onto  the  same  chip.  This  network  was  chosen  because  it  does  not 
exhibit  any  of  the  bottlenecks  described  earlier,  e.g.  the  root  bottleneck  in  the 
tree  network,  which  might  mask  the  effect  of  the  random  fluctuations.  Interar¬ 
rival  times  are  again  selected  from  an  exponential  distribution.  The  results  of 
these  experiments  are  shown  in  figure  3.24.  Each  curve  represents  performance 
'for  a  different  set  of  message  arrival  times.  Fluctuations  are  introduced  by 
using  different  seeds  in  the  random  number  generator  which  determines 
interarrival  times.  As  shown  in  figure  3.24,  the  delay  in  the  lightly  loaded  net¬ 
work-  varies  by  up  to  1.755.  while  network  throughput  varies  by  up  to  8.8%. 
Uncertainty  in  delay  does  increase  significantly  as  the  network  approaches 
saturation,  however  this  does  not  affect  the  conclusions  derived  above  since 
they  were  based  on  the  previous  two  performance  measures. 
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Rgure  3.24.  Precisian  of  simulations. 


Message  delay  uncertainty  leads  to  uncertainty  in  the  execution  time  of  the 
programs.  This  latter  quantity  is  the  performance  measure  used  in  the  bulk  of 
the  simulations  presented  here.  The  percentage  of  uncertainty  in  execution 
time  will  be  less  than  that  corresponding  to  message  delay  however,  because 
message  delay  only  affects  one  component  of  the  overall  execution  time.  Execu¬ 
tion  time  is  composed  of  two  components,  the  time  spent  executing  instructions 
and  the  time  spent  waiting  for  data.  Message  delays  only  affect  the  latter  com¬ 
ponent  Although  the  absolute  magnitude  of  fluctuations  may  be  the  same  for 
both  message  delay  and  execution  time,  the  percentage  of  the  uncertainty  will 
be  smaller  in  the  latter,  since  it  is  always  larger.  Thus,  the  uncertainty  in  mes¬ 
sage  delay  described  above  will  lead  to  an  even  smaller  uncertainty  in  execution 
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3.10.  Summary  of  Simulation  Studies 

In  most  cases,  the  simulation  results  support  the  analytical  results  dis¬ 
cussed  earlier.  When  discrepancies  do  occur,  they  favor  networks  using  a  small 
number  of  ports.  It  was  seen  that  bottlenecks  are  the  source  of  these 
discrepancies.  Thus,  the  simulation  results  support  the  conclusion  found  earlier 
that  components  with  a  small  number  of  ports  are  better.  These  results  also 
demonstrate  the  utility  of  incorporating  efficient  mechanisms  for  handling  mul¬ 
tiple  destination  messages.  Such  a  mechanism  can  yield  significant  perfor¬ 
mance  improvement  in  algorithms  relying  heavily  on  global  information. 
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CHAPTER  FOUR 

DESIGN  AND  IMPLEMENTATION  OF 
COMMUNICATION  COMPONENTS 


This  chapter  examines  the  amount  of  circuitry  required  to  implement  a 
VLSI  communication  component.  Alternative  mechanisms  for  transporting  data 
through  communication  networks  are  first  compared,  and  a  virtual-circuit  tran¬ 
sport  mechanism  is  argued  to  be  the  most  attractive  alternative  for  the  net¬ 
works  discussed  here.  Details  of  such  a  transport  mechanism  are  then 
described.  Next,  alternative  schemes  for  providing  hardware  support  for  three 
key  communication  functions:  routing,  buffer  management,  and  flow  control,  are 
described.  Practical  figures  for  the  number  of  channels  and  buffers  within  each 
component  are  derived  and  used  as  the  basis  for  estimates  of  the  complexity  of 
such  a  component.  It  will  be  seen  that  the  functional  capabilities  of  VLSI  chips 
are  now  sufficient  to  allow  the  construction  of  communication  components  with 
enough  buffer  space  and  virtual  channels  to  provide  high-bandwidth  communica¬ 
tions  in  multicomputer  networks. 

4.1.  Transport  Mechanisms 

Since  the  primary  function  of  the  communication  network  is  to  move  data, 
a  transport  mechanism,  Le.  the  means  by  which  data  is  transmitted  through  the 
network,  must  be  selected.  A  classification  tree  which  includes  the  various  tran¬ 
sport  mechanisms  in  use  today  is  shown  in  figure  4.1.  A  number  of  characteris¬ 
tics  which  distinguish  these  transport  mechanisms  are  also  shown.  Briefly, 
these  characteristics  are: 

(1)  Data  Unit;  The  unit  of  data  transported  through  the  network  is  either  a 

variable-length  message  or  a  fixed-length  packet 


(2)  Routing  Oberhead:  The  overhead  associated  with  message  routing  is 
incurred  either  on  a  hop-by-hop  basis  at  each  node  in  the  network  or  only  in 


the  initial  set-up  of  a  circuit. 

(3)  BandvAdth  Allocation:  Bandwidth  is  allocated  by  the  network  either  stati¬ 
cally.  e.g.  when  a  circuit  is  set  up.  or  dynamically  as  messages  enter  the 
network. 

(■4)  Buffering  Complexity:  The  complexity  of  the  buffering  hardware  varies  with 
the  sophistication  of  the  chosen  transport  mechanism. 
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Figure  4.1.  Transport  Mechanisms. 


According  to  the  tree  in  figure  4.1,  the  first  alternative  to  consider  in 
selecting  a  transport  mechanism  is  between  a  circuit-switched  and  a  store-and- 
forward  approach.  The  circuit-switched  approach  is  best  exemplified  (at  least 
conceptually)  by  the  telephone  system;  When  someone  picks  up  a  telephone  and 
dials  a  phone  number,  a  circuit  is  established  between  the  caller  and  the  party 
being  called.  Cnee  this  circuit  is  established,  it  remains  intact  until  either  party 
hangs  up.  Examples  of  circuit-switched  networks  are  described  in  [Joel79, 
Mass79].  The  most  distinguishing  characteristic  of  this  approach  is  the  fact  that 
communicating  parties  are  guaranteed  a  certain  bandwidth  and  maximum 
latency  when  the  call  is  established.  Since  the  communication  network  cannot 
know  when  data  will  be  transmitted,  bandwidth  must  be  allocated  statically  when 
the  call  is  set  up.  Otherwise,  the  bandwidth  may  not  be  available  when  it  is 
needed.  Users  are  allocated  a  certain  amount  of  bandwidth  regardless  of 
whether  or  not  they  actually  use  it.  If  communications  are  bursty,  as  is  often 
the  case  in  computer  networks,  much  of  the  network’s  bandwidth  will  be  wasted. 
This  is  the  primary  disadvantage  of  the  circuit-switched  approach. 

However,  the  circuit-switched  approach  also  offers  a  number  of  advantages. 
Routing  overhead  is  usually  paid  only  when  the  circuit  is  set  up,  so  subsequent 
messages  can  flow  through  tha  network  with  little  delay.  This  reduces  the  aver¬ 
age  delay  on  circuits  carrying  more  than  one  message.  Also,  buffering  stra¬ 
tegies  are  simpler  than  those  required  for  store  and  forward  networks  because 
bandwidth  allocation  is  performed  statically.  If  circuit-switching  is  used,  the 
network  can  be  designed  to  ensure  that  the  rate  of  traffic  flow  into  each  node  of 
the  network  never  exceeds  the  rate  of  flow  out,  alleviating  buffer  overflow  prob¬ 
lems.  In  fact,  if  each  circuit  is  implemented  as  a  physical  electrical  connection 
between  the  communicating  parties  (e.g.  a  series  of  relays),  the  network  need 
not  provide  any  buffering  at  all! 


Store-and-forward  networks  avoid  the  wasted  bandwidth  problem  described 
above  by  allocating  bandwidth  dynamically  to  messages  as  they  enter  the  net¬ 
work.  Three  types  of  store-and-forward  networks  have  been  implemented  in  the 
past: 

(1)  Datagram  networks. 

(2)  Packet  switched  networks. 

(3)  Virtual  circuit  networks. 

Datagram  networks  are  characterized  by  the  unit  of  data  sent  through  the 
network  —  variable  length  messages.  Since  each  message  can  be  relatively 
large,  communication  components  would  have  to  provide  a  relatively  large 
amount  of  buffer  space  to  hold  arriving  messages.  This  implies  that  a  large 
amount  of  circuitry  in  each  component  must  be  devoted  to  messages  buffers.  In 
addition,  since  messages  may  vary  in  length,  variable  size  buffers  must  be  used. 
This  increases  the  complexity  of  the  buffer  management  circuitry  significantly, 
since  the  buffer  selected  for  a  particular  message  must  be  at  least  as  large  as 
the  message.  This  problem  is  identical  to  the  difference  between  virtual 
memory  systems  based  on  segments,  whose  complexity  usually  requires  a 
software  implementation,  and  those  based  on  pages,  which  are  usually  imple¬ 
mented,  at  least  to  a  large  extent,  in  hardware.  Finally,  routing  overhead  in  the 
datagram  approach  is  worse  than  that  of  the  circuit-switched  approach  because 
routing  decisions  must  be  made  on  a  hop-by-hop  basis  with  each  message  sent 
into  the  network. 

The  packet  switched  transport  mechanism  alleviates  many  of  the  buffering 
problems  described  above.  Here,  each  message  is  divided  into  a  number  of 
(usually)  fixed-sized  packets  which  are  routed  separately  through  the  network. 
An  end-to-end  scheme  is  required  to  reassemble  the  message  from  its  consti¬ 
tuent  packets.  Since  packets  can  be  relatively  small,  buffering  requirements  in 


each  component  are  reduced.  The  use  of  fixed-sized  packets  also  simplifies  the 
buffer  management  circuitry.  This  approach  does  incur  a  significant  amount  of 
overhead  on  an  end-to-end  level  to  reassemble  messages  however.  Since  pack¬ 
ets  are  routed  separately  through  the  network,  they  may  follow  different  routes 
to  the  destination  node,  and  therefore  may  arrive  in  an  order  different  from  that 
at  which  they  were  sent  The  other  disadvantage  of  the  packet  switched 
approach  is  that  the  routing  overhead  problem  is  worse  than  that  of  the 
datagram  scheme,  since  this  overhead  now  occurs  on  every  pacJbef  rather  than 
on  every  message. 

If  we  examine  the  transport  mechanisms  described  thus  far,  we  see  that  the 
circuit-switched  approach  suffers  from  static  bandwidth  allocation,  while  the 
packet  switch  approach  suffers  from  reassembly  and  routing  overhead.  One 
might  hope  that  a  hybrid  which  combines  these  two  approaches  can  achieve  the 
best  of  both  mechanisms  without  their  respective  disadvantages.  This  is  the 
motivation  behind  the  virtual  circuit  transport  mechanism,  which  is  a  mixture 
of  packet  switched  and  circuit-switched  techniques.  Here,  a  virtual  circuit  is 
established  between  processors  which  wish  to  communicate.  A  virtual  circuit  is 
a  fixed,  unidirectional  path  through  the  network  from  one  processor  to  another. 
All  messages  sent  on  this  circuit  travel  along  this  path  to  reach  their  destina¬ 
tion. 

Let  us  consider  the  characteristics  of  the  virtual  circuit  transport  mechan¬ 
ism  (listed  in  figure  4.1).  Like  the  packet  switched  mechanism,  the  data  unit  is 
a  fixed  sized  packet  (simplifying  the  buffering  problems  of  the  datagram 
mechanism).  Routing  is  similar  to  the  circuit-switched  approach  to  the  extent 
that  the  routing  algorithm  need  only  be  applied  when  the  circuit  is  set  up,  and 
not  with  subsequent  packets.  It  will  be  seen  however,  that  some  overhead  is  still 
required  to  route  messages,  so  the  routing  overhead  is  intermediate  between 


the  circuit  switched  and  the  datagram/packet  switched  approaches.  Since  a 
store-and-forward  mechanism  is  used,  network  bandwidth  is  allocated  dynami¬ 
cally,  although  allocation  is  not  as  adaptive  as  it  is  in  packet  switched  networks 
because  packets  are  constrained  to  follow  a  fixed  path  from  source  to  destina¬ 
tion.  The  fixed  path  restriction  in  the  virtual  circuit  mechanism  is  necessary  to 
reduce  routing  overhead  and  to  avoid  reassembly  overhead.  Thus,  while  packet 
switched  networks  may  be  able  to  achieve  higher  bandwidth  along  an  end-to-end 
connection  by  utilizing  multiple  paths  between  the  two  processors,  the  virtual 
circuit  scheme  will  yield  lower  latency  on  individual  messages  since  they  spend 
less  time  in  each  node  waiting  for  routing  decisions  to  be  made.  In  addition,  the 
virtual  circuit  mechanism  can  utilize  multiple  paths  between  two  nodes  by 
establishing  several  circuits  between  the  two  processors. 

Thus,  a  virtual  circuit  transport  mechanism  appears  to  be  the  most  attrac¬ 
tive  for  the  networks  described  here.  Details  of  the  operation  of  this  mechanism 
are  described  in  the  next  section.  Hardware  implementations  are  described  in 
the  sections  which  follow. 

4.2.  A  Virtual  Circuit  Based  Communication  System 

The  communication  domain  studied  here  is  a  packet-based  network  using  a 
virtual  circuit  transport  mechanism.  Mechanisms  for  establishing,  maintaining, 
and  tearing  down  circuits  are  described  in  this  section. 

4.2.1.  Virtual  Circuits 

Each  processor  has  a  fixed  number  of  input  and  output  circuits  for  receiv¬ 
ing  and  sending  data  respectively.  Sending  a  message  is  a  three  step  process. 
First,  a  virtual  circuit  (i.e.  a  path  of  time-multiplexed  links)  to  the  destination 
processor  is  established  by  sending  a  message  header  with  routing  information 
through  the  network.  Once  a  circuit  is  set  up,  an  arbitrary  amount  of  data. 


which  may  consist  of  several  logical  messages,  can  be  sent  along  this  circuit. 
Data  can  follow  the  message  header  immediately  without  an  end-to-end 
handshake  and  need  not  be  transmitted  continuously  for  the  circuit  to  remain 
intact.  This  approach  reduces  the  routing  overhead  on  all  packets  except  the 
message  header.  When  the  circuit  is  no  longer  needed,  it  is  torn  down  by  send¬ 
ing  a  tagged  message  trailer. 

The  communications  system  provides  only  a  data  transport  facility.  Except 
for  the  header  and  trailer  information,  all  data  passes  uninterpreted  through 
intermediate  nodes.  Error  checking  and  retransmission  are  left  to  an  end-to- 
end  protocol.  This  allows  the  forwarding  of  data  packets  in  each  node  to  begin 
before  the  entire  packet  has  arrived  if  the  proper  outgoing  link  is  idle  (virtual 
cut-through  [Kerm79]  ).  If  error  checking  and  retransmissions  were  performed 
within  the  network  on  a  hop-by-hop  basis,  forwarding  could  not  begin  until  the 
entire  packet  has  arrived  and  was  checked  for  errors,  since  otherwise  an 
erroneous  packet  would  have  been  forwarded  by  the  time  the  error  was 
detected.  This  end-to-end  approach  is  justified  by  the  low  error  rates  observed 
in  local  computer  networks  [ShocBO].  Since  the  networks  discussed  here  cover 
an  even  smaller  geographic  area,  and  thus  are  less  susceptible  to  environmental 
noise,  this  assumption  is  even  more  appropriate. 

4.2.2.  Virtual  Channels 

The  communications  domain  can  be  viewed  as  a  simple,  connected  graph. 
Nodes  and  edges  represent  communications  components  and  links,  respectively. 
A  circuit  from  one  processor  to  another  corresponds  to  a  path  in  this  graph. 
Two  distinct  paths  (say  from  node  A  to  B  and  from  C  to  D  in  figure  4.2)  may  use  a 
common  edge  (from  X  to  Y).  Thus,  the  link  associated  with  that  edge  must  be 
multiplexed  between  the  two  paths,  and  provisions  must  be  made  to  ensure  that 
data  from  A  is  sent  to  B,  and  not  to  D. 
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Figure  4.2.  Two  Paths  Multiplexed  through  the  Same  link. 

Each  physical  link  is  divided  into  some  fixed  number  of  unidirectional  vir¬ 
tual  channels.  Each  channel  can  carry  data  for  one  virtual  circuit  (i.e.  one 
path).  Thus,  a  circuit  from  one  node  to  another  consists  of  a  sequence  of  chan¬ 
nels  on  the  links  in  the  path  between  the  two  nodes.  The  circuit  from  A  to  B  in 
figure  4.2,  for  example,  might  use  channel  #3  to  get  to  X,  then  #5  to  get  to  Y, 
and  finally  #7  to  get  to  B. 

When  node  X  sends  data  to  node  Y,  the  latter  must  determine  which  circuit 
this  data  belongs  to.  Two  commonly  used  techniques  for  providing  this  informa¬ 
tion  are,  among  others: 

(l)  Divide  the  link  into  a  fixed  number  of  time  slots  and  statically  assign  each 


time  slot  to  a  channel  (e.g.  the  first  time  slot  might  be  assigned  to  channel 
#0.  the  second  to  channel  #1,  etc.).  The  time  slot  on  which  the  data  arrives 
identifies  the  channel  that  sent  it. 


(2)  Precede  the  data  with  a  tag  that  identifies  the  channel  it  is  being  sent  on. 
In  this  scheme,  the  available  bandwidth  on  the  link  is  allocated  to  the  vari¬ 


ous  channels  by  some  demand-driven  scheduling  algorithm. 

In  the  first  scheme,  the  link  is  effectively  divided  into  a  number  of  lower 
bandwidth  links,  with  the  sum  of  these  bandwidths  equal  to  that  of  the  physical 
link.  If  a  channel  does  not  send  any  data,  its  allocated  bandwidth  is  wasted.  In 
addition,  latency  is  increased  since  each  channel  must  wait  for  its  turn  to  send  a 
unit  of  data.  In  the  second  scheme,  the  entire  bandwidth  of  the  link  can  be  allo¬ 
cated  to  channels  upon  demand,  i.e.  when  they  have  data  to  send,  so  the 
inefficiencies  associated  with  the  previous  approach  are  avoided.  However, 
some  bandwidth  is  required  to  carry  the  channel  tag.  Demand-driven  time- 
multiplexing  is  superior  if  the  degree  of  multiplexing  on  each  link  is  high,  and 
many  channels  do  not  always  have  data  to  send.  This  is  often  the  case  in 
computer-to-computer  communications,  so  the  dynamic  approach  is  more  suit¬ 
able  for  the  networks  described  here. 

4.2.3.  Routing  Hardware 

In  order  to  route  messages  through  each  node  of  the  network,  channels 
entering  a  node  (input  channels)  must  be  "linked”  to  channels  leaving  the  node 
(output  channels).  Each  node  maintains  a  set  of  translation  tables  to  perform 
this  function.  There  is  one  translation  table  for  each  input  port  of  a  node.  Each 
entry  of  the  translation  table  contains  two  fields:  an  output  port,  and  the 
number  of  a  channel  on  that  port.  When  data  arrives  on  an  input  channel,  say 
channel  #3,  entry  3  of  the  translation  table  for  that  port  is  read  to  yield  the  out¬ 
put  port  and  the  number  of  the  channel  the  data  is  to  be  forwarded  on. 

The  translation  tables  logically  link  incoming  and  outgoing  channels,  and 
thus  establish  the  various  virtual  circuits  through  the  node.  Setting  up  these 
circuits  involves  allocating  channels  and  updating  translation  tables  along  each 
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path  from  source  to  destination.  This  task  is  performed  by  a  routing  controller 
residing  in  each  communications  node. 

Initially,  all  translation  tables  specify  that  data  is  to  be  sent  to  the  local 
routing  controller.  When  a  message  header  setting  up  a  new  circuit  arrives  at  a 
node,  the  routing  controller  analyzes  the  destination  address  in  the  header, 
determines  the  proper  output  port  with  the  use  of  some  routing  algorithm,  allo¬ 
cates  a  free  output  channel,  and  updates  the  translation  table  at  the  input  port. 
Measurements  on  a  TTL  prototype  of  such  a  routing  controller  [Fuji80]  show  that 
this  entire  operation  can  be  done  in  4-5  /zsec  if  a  free  channel  is  available  on 
the  selected  link.  Subsequent  data  is  then  forwarded  without  intervention  by 
the  routing  controller.  Similarly,  when  the  circuit  is  torn  down,  the  channel  is 
released,  and  the  corresponding  translation  table  entry  is  reset  to  point  to  the 
routing  controller. 

4.2.4.  Packet  Types  and  Formats 

Three  types  of  packets  have  been  discussed  thus  far:  a  "set-up  packet" 
which  establishes  virtual  circuits,  a  "trailer  packet"  which  tears  them  down,  and 
a  "data  packet"  which  carries  data.  In  addition,  it  is  useful  to  provide  a  "clear 
packet"  which  flows  through  a  virtual  circuit,  removing  any  data  packets  it 
encounters  along  the  way.  Such  a  mechanism  is  useful  in  error  recovery  proto¬ 
cols  to  reset  virtual  circuits  to  a  "known"  state. 

Packets  must  be  tagged  to  distinguish  the  various  types.  Each  packet  is 
preceded  by  a  header  which  indicates  the  packet's  type,  as  well  as  a  channel 


number  indicating  which  virtual  circuit  the  packet  belongs  to.  Set-up  packets 
also  carry  routing  information  (e.g.  a  destination  address)  which  the  routing 
controller  uses  to  set  up  the  circuit  Assuming  two  bits  to  indicate  type,  one  bit 
for  parity,  and  a  one  byte  header  on  each  packet  (excluding  routing  information 
on  set-up  packets).  5  bits  remain  for  a  virtual  channel  number.  This  implies  a 


maximum  of  32  channels  can  be  supported  on  each  link.  Later,  it  is  argued  that 
under  current  technology,  a  larger  number  of  channels  should  be  supported,  say 
64  or  128.  Since  it  is  convenient  to  restrict  the  header  information  to  an 
integral  number  of  bytes,  a  two-byte  header  could  be  used  to  support  this  many 
channels. 

An  alternative  approach  to  the  fixed  length  header  scheme  described  above 
is  to  use  variable  length  packet  headers.  For  example,  assuming  that  most  of 
the  packets  flowing  through  the  network  are  data  packets,  we  could  confine  the 
overhead  in  these  packets  to  a  single  byte,  while  forcing  other  packet  types  to 
use  several  bytes.  Under  this  scheme,  the  header  of  each  data  packet  consists 
of  a  7-bit  channel  number  and  a  single  bit  for  parity.  One  channel  number,  say 
jflO,  is  declared  to  be  "undefined1*.  When  a  packet  header  specifies  this  channel 
number,  it  indicates  that  the  packet  is  not  a  data  packet,  but  rather  some  other 
type.  Subsequent  bytes  indicate  the  type  of  packet,  and  any  type-specific  infor¬ 
mation.  This  approach  reduces  the  overhead  required  on  data  packets,  and  thus 
provides  better  performance  in  transmitting  these  packets  than  the  fixed-length 
header  scheme  described  above.  Although  the  amount  of  time  required  to  pro¬ 
cess  the  other  packet  types,  the  set-up  packet  in  particular,  is  slightly 
increased,  delays  on  these  other  packet  types  are  less  cruciaL  The  assumption 
that  data  packets  use  a  single  byte  for  header  information  was  used  in  much  of 
the  analysis  presented  earlier. 

4.3.  Key  Functions  of  the  Communication  Component 

Any  communication  component  must  provide  mechanisms  for  routing  mes¬ 
sages  to  their  proper  destination,  managing  the  limited  amount  of  buffer  space, 
and  controlling  the  rate  at  which  packets  flow  from  one  node  to  another. 
Hardware  implementation  of  these  mechanisms  is  required  to  achieve  high- 
performance.  Mechanisms  to  perform  these  functions  are  outlined  in  this 
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section.  Hardware  implementations  are  described  in  the  section  that  follows. 
Implementation  of  other  portions  of  the  communication  component,  i.e.  the  I/O 
ports  and  routing  controller,  are  only  briefly  summarized  since  they  are 
described  elsewhere  [Laur79,  WongSl,  Fuji80]. 

A  block  diagram  indicating  the  functions  that  must  be  provided  by  each 
component  is  shown  in  figure  4.3.  The  component  contains  three  or  more  ports, 
each  accommodating  a  link  to  a  neighboring  node.  It  also  contains  a  certain 
amount  of  buffer  memory,  bookkeeping  tables,  and  control  logic.  Translation 
tables  logically  link  incoming  and  outgoing  channels,  and  thus  establish  the  vari- 
ous  virtual  circuits  through  the  component.  Finally,  a  microcoded  engine  called 
the  routing  control  is  responsible  for  setting  up  virtual  circuits  and  implement¬ 
ing  less  frequently  used  network  functions  such  as  failure  recovery  protocols. 

4.3.1.  Routing 

All  communication  networks  require  some  routing  algorithm  to  build  the 
paths,  Le.  the  virtual  circuits,  between  nodes  sending  and  receiving  messages.  A 
great  deal  of  research  has  been  done  in  the  area  of  routing  in  loosely  coupled 
computer  networks,  and  much  of  this  work  is  applicable  here  [Gerl81,  Tane8l]. 
In  the  context  of  the  proposed  communication  domain,  we  will  only  consider 
totally  distributed  routing  that  does  not  rely  on  a  centralized  authority.  For  this 
discussion  it  is  also  appropriate  to  distinguish  between  regular  networks  with  a 
predefined  topology,  such  as  arrays  or  binary  trees,  and  irregular  networks  of 
arbitrary  connectivity. 

In  regular  networks,  routing  can  be  performed  in  each  node  by  a  state 
machine  which  performs  a  fixed  algorithm  based  on  the  local  and  destination 
addresses.  In  square  lattices,  for  example,  the  routing  controller  could  forward 
the  message  header  in  a  direction  that  would  reduce  the  difference  between  the 
x-  or  y-  coordinates  of  the  current  and  the  destination  nodes.  Routing 
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Figure  4.3.  Functions  provided  by  communication  component. 

algorithms  for  binary  half-ring  and  full-ring  trees  have  been  discussed  elsewhere 
[Sequ70]. 

For  a  general-purpose  communication  component,  the  routing  algorithm 
must  not  be  frozen  in  hardware.  A  routing  controller  with  a  writable  program 
memory  is  more  appropriate  and  guarantees  that  the  same  component  can 
serve  many  different  network  topologies.  A  routing  algorithm  suitable  for  the 
particular  network  structure  could  be  broadcast  at  system  initialization. 

For  irregular  networks,  routing  may  be  based  on  suitable  lookup  tables.  In 
a  decentralized  system  each  node  i  has  entries  of  the  form: 
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NN  =  Ri(DN). 

implying  that  messages  destined  for  node  DN  are  forwarded  by  node  t  to  neigh¬ 
bor  node  NN.  This  lookup  table,  commonly  called  a  routing  table,  can  be 
defined  statically,  or  it  can  be  maintained  dynamically  using  information 
exchanged  between  neighboring  nodes.  The  latter  approach  also  allows  the  net¬ 
work  to  automatically  reconfigure  itself  should  the  topology  change  due  to  node 
failure  or  network  expansion  [Taji77].  Techniques  to  initialize  and  maintain  the 
routing  tables  are  discussed  in  [GerlBl]. 

If  the  network  has  many  nodes,  the  routing  table  will  be  excessively  large 
since  a  separate  entry  is  required  for  each  destination  node.  A  common  tech¬ 
nique  which  reduces  the  size  of  this  table  is  to  employ  hierarchical  names  and 
multiple  routing  tables  per  node  [Kamo78].  An  example  of  such  a  mechanism  is 
seen  in  the  telephone  system  in  which  names  (telephone  numbers)  consist  of  an 
area  code  and  a  seven  digit  number.  When  a  call  to  a  number  with  a  different 
area  code  is  made,  the  area  code  is  first  used  to  route  the  cadi  to  the  correct 
area,  and  then  the  phone  number  is  used  to  locate  the  final  destination.  Con¬ 
ceptually,  routing  could  thus  be  performed  as  follows: 

(1)  If  the  area  code  of  the  destination  matches  that  of  the  router,  then  the 
seven  digit  number  is  used  to  locate  the  next  node  via  a  "neighborhood" 
routing  table. 

(2)  If  the  area  code  does  not  match  that  of  the  node  doing  the  routing,  then  the 
area  code  is  used  to  look  up  the  next  node  via  an  "area  code”  routing  table. 
The  remaining  seven  digits  in  the  phone  number  are  ignored. 

Thus,  a  two-level  naming  hierarchy  is  used  along  with  a  routing  table  for  each 
level.  Such  a  scheme  reduces  the  table  size  by  grouping  nodes  which  are  far 
away  into  a  single  entry  in  the  "area  code"  routing  table. 
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One  can  easily  extend  this  principle  to  an  arbitrary  number  of  naming  lev- 
11  els.  To  determine  the  number  of  levels  required  to  minimize  the  storage  space 

required  for  routing  tables,  let  there  be  l  levels,  with  entries  in  the  level  i 
table.  The  object  is  to  minimize  g  1+^2+  ■  •  ■  +  gi  subject  to  constant 
P  N  =  g ix0zx  -  '  *.7i.  the  number  of  nodes  in  the  network.  It  is  easy  to  show  that 

this  sum  is  minimized  for 

9\  ~  3z  -  ‘  '  -  9i  -  e  and  l  -  In  N, 

|  where  e  is  approximately  2.718.  Thus,  to  minimize  the  table  size  in  each  node, 

there  should  be  many  levels  with  few  entries  in  each  level  [McQu74]. 

The  reduction  in  table  size  resulting  from  a  multi-level  routing  scheme  can 
I  be  substantial.  A  16-bit  destination  address  partitioned  into  eight  2-bit  fields 

requires  eight  4-entry  routing  tables,  or  a  total  of  32  entries.  The  single-level 
routing  table  would  require  85,536  entries.  The  routing  controller  described  in 
|  [Fuji80]  uses  a  single-level  lookup  table  with  256  entries.  A  hardware  implemen¬ 

tation  of  a  hierarchical  routing  scheme  will  be  presented  later. 

4.3.2.  Buffer  Management 

> 

Each  message  passed  into  the  communication  domain  must  be  subdivided 
by  the  sender  into  some  number  of  fixed-length  packets.  As  discussed  earlier, 
allowing  variable  length  packets  adds  a  considerable  amount  of  complexity  to 
the  component.  These  packets  form  the  unit  of  data  transmitted  across  the 
links  of  the  communication  domain.  Due  to  conflicts  that  arise  when  several 
packets  simultaneously  require  the  use  of  the  same  link,  buffering  is  required  in 
’  each  node.  The  communication  component  must  have  some  strategy  for  manag¬ 

ing  these  buffers. 

A  scheme  is  necessary  to  allocate  a  node’s  buffers  among  the  virtual  cir¬ 
cuits  using  the  node.  A  simple  solution  is  to  give  each  channel  on  each  link  a 
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separate  buffer.  Tins  is  inefficient  however,  since  much  of  the  buffer  space  will 
be  unused  most  of  the  time.  By  allowing  several  channels  to  share  buffers, 
fluctuations  in  the  need  for  buffer  space  can  be  averaged  over  a  large  number  of 
communication  paths,  and  fewer  buffers  are  required  to  achieve  the  same  per¬ 
formance.  A  mapping  is  then  required  to  link  each  channel  to  the  buffers  hold¬ 
ing  packets  for  that  channel  so  that  they  can  be  found  when  it  is  time  to  forward 
them.  Furthermore,  when  a  new  packet  arrives,  an  empty  buffer  must  be  found. 
From  this  perspective,  buffer  management  is  similar  to  the  management  of  a 
cache  memory:  a  program  (here,  a  channel)  must  fit  blocks  from  main  memory 
(packets)  into  cache  pages  (buffers). 

As  in  cache  memory  design,  there  are  three  well  known  schemes  for  per¬ 
forming  this  mapping: 

(1)  direct  mapping 

(2)  set-associative  mapping 

(3)  fully  associative  mapping 

In  turn,  these  three  schemes  offer  an  increased  degree  of  buffer  sharing,  and 
thus  improved  memory  utilization,  but  at  the  cost  of  increased  complexity  in 
the  control  circuitry.  They  are  distinguished  by  restrictions  on  where  a 
channel's  packets  can  be  placed.  In  the  direct  mapping  scheme  (minimal  shar¬ 
ing).  each  channel  has  a  set  of  buffers  dedicated  to  it,  Le.  its  own  flfo  buffer 
queue.  The  set-associative  scheme  (moderate  sharing)  allows  each  channel  to 
use  a  larger  set  of  buffers,  but  it  is  no  longer  given  sole  access  to  them.  This 
scheme  might  be  implemented  by  letting  all  channels  of  a  single  port  share  a 
pool  of  buffers  dedicated  to  this  port.  In  the  fully  associative  scheme  (maximal 
sharing),  each  node  has  a  centralized  pool  of  buffers  which  all  channels  share. 
Implementations  of  the  set-associative  and  fully  associative  schemes  will  be  dis¬ 
cussed  in  later  sections.  An  implementation  of  the  direct  mapping  scheme  has 
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been  described  previously  [Laur79,  Sequ78]. 

4.3.3.  Flow  Contra! 

Flow  control  refers  to  the  mechanism  which  regulates  the  transmission  of 
data  packets  along  virtual  circuits.  The  network  must  be  able  to  "throttle’’ 
traffic  on  virtual  circuits  to  prevent  buffer  overflow  (such  mechanisms  are  some¬ 
times  referred  to  as  congestion  control  in  the  literature  [TaneSl]  ).  and  to  han¬ 
dle  situations  in  which  a  processor  is  sent  more  messages  than  it  can  immedi¬ 
ately  receive.  In  addition  to  providing  a  mechanism  which  allows  components  to 
throttle  traffic,  a. policy  is  also  required  to  determine  which  virtual  circuits  must 
be  throttled,  and  when.  Such  a  policy  will  be  discussed  next,  followed  by  a  dis¬ 
cussion  of  different  throttling  mechanisms. 

Since  one  of  the  purposes  of  flow  control  is  to  avoid  buffer  overflow,  a 
natural  policy  is  to  begin  throttling  traffic  when  the  pool  of  free  (Le.  empty) 
buffers  becomes  depleted.  If  a  node  is  inundated  with  data,  packets  will  "back 
up"  along  the  virtual  circuits  leading  up  to  it,  much  like  the  way  cars  back  up  on 
a  congested  freeway.  This  type  of  flow  control,  called  "back  pressure  flow  con¬ 
trol",  is  analogous  to  water  (packets)  flowing  through  a  pipe  (buffers).  If  the 
pipe  becomes  blocked  or  constricted,  water  backs  up  to  its  source.  Such  a 
mechanism  has  been  used  successfully  in  TYMNET,  a  loosely  coupled,  commer¬ 
cial  communication  network  [TymeBl]. 

The  flow  control  policy  described  above  can  lead  to  a  problem  called  "buffer 
hogging".  Here,  one  virtual  circuit  uses  more  than  its  share  of  the  buffers  in  a 
node.  If  a  virtual  circuit  becomes  blocked,  e.g.  due  to  a  congested  output  link, 
packets  may  continue  to  arrive  on  that  virtual  circuit  and  occupy  most,  or  all  of 
the  buffers  in  the  node.  Without  some  mechanism  to  restrict  buffer  sharing, 
buffer  hogging  will  impede  other  traffic  using  the  node  and  lead  to  deadlock 
situations.  This  situation  can  be  avoided  by  controlling  the  maximum  number  of 


buffers  each  channel  can  use.  It  might  be  noted  that  the  direct  mapping 
scheme,  and  to  a  lesser  extent  the  set-associative  scheme,  automatically  pro¬ 
vide  some  protection  against  buffer  hogging,  since  they  inherently  restrict 
buffer  sharing.  All  three  schemes  however,  need  some  mechanism  to  ensure 
that  data  is  not  lost  if  no  free  buffers  are  available. 

Thus,  in  order  to  prevent  buffer  hogging,  each  output  channel  may  not  hold 
more  than  some  "channel  limit"  of  buffers  at  once.  Even  with  this  restriction 
however,  another  form  of  buffer  hogging  may  still  arise.  A  congested  output  link 
could  use  all  of  the  node's  buffers  and  block  traffic  on  other  links.  To  prevent 
this,  each  output  port  is  restricted  to  using  no  more  than  some  maximum 
number  of  buffers,  determined  by  a  higher  level  protocol.  This  maximum 
number,  called  the  "port  limit",  can  be  changed  dynamically  to  shift  additional 
buffers  to  highly  utilized  ports,  while  still  providing  some  space  for  traffic  on 
lightly  loaded  ports.  Studies  indicate  that  by  restricting  the  number  of  buffers 
an  output  port  can  use  "output  port  buffer  hogging”  is  prevented,  and  a 
significant  improvement  in  the  bandwidth  provided  by  the  node  is  obtained 
[Irla78].  These  studies  also  indicate  that  as  a  general  rule,  each  port  should  not 
be  allowed  to  use  more  than  b/  buffers  in  a,p-port  node  with  6  buffers. 

Assuming  a  buffer  allocation  policy  is  used  to  control  the  rate  of  packet  for¬ 
warding,  let  us  now  examine  the  flow  control  mechanism  itself,  i.e.  the  mechan¬ 
ism  which  performs  the  actual  throttling.  Two  mechanisms,  sender-controlled 
and  receiver-controlled  throttling,  will  be  discussed.  They  are  characterized  by 
whether  the  sending  or  the  receiving  node  implements  the  policy  described 
above.  The  receiver-controlled  mechanism  is  the  simpler  mechanism,  and  will 
be  described  first. 

The  receiver-controlled  flow  control  mechanism  can  be  implemented  by  a 
send/acknowledge  protocol  to  transmit  data  over  the  link.  In  this  scheme,  each 


node  sends  a  packet,  and  waits  for  the  receiver  to  return  a  control  signal  indi¬ 
cating  whether  it  accepted  or  rejected  (i.e.  discarded)  the  packet.  An  "ack"  sig¬ 
ned  denotes  an  accepted  packet  while  a  "nack"  denotes  a  rejected  packet.  If  a 
nack  is  returned,  the  packet  must  be  retransmitted  at  a  later  time. 

A  receiver  may  choose  to  reject  a  packet  because  of  buffer  space  limita¬ 
tions  or  transmissions  errors.  Here,  it  is  assumed  that  communication  com¬ 
ponents  only  check  header  information  for  transmission  errors,  since  the  virtual 
cut-through  mechanism  prevents  retransmission  if  errors  in  the  data  are 
detected.  With  virtual  cut-through,  the  first  bytes  of  the  packet  may  have  been 
forwarded  to  the  next  node  before  an  error  in  later  bytes  is  detected,  making 
immediate  recovery  difficult,  if  not  impossible.  Errors  in  data  bytes  must  be 
handled  by  an  end-to-end  protocol  which  detects  and  retransmits  damaged 
packets. 

It  is  also  assumed  here  that  each  link  has  a  separate  control  line  to  carry 
the  ack/nack  signal  back  to  the  sender.  Alternatively,  the  control  signal  could 
be  piggy-backed  onto  a  packet  going  in  the  opposite  direction,  however,  this 
leads  to  a  "looser  coupling"  between  sender  and  receiver,  forcing  the  sender  to 
either  deal  with  multiple  unacknowledged  packets  pending  over  the  link 
[Pouz78]  (adding  a  considerable  amount  of  complexity  to  the  circuitry  in  the 
port),  or  to  stop  using  the  link  until  the  acknowledgement  arrives  (wasting 
bandwidth).  Since  the  receiver  can  generate  an  acknowledgement  after  only  the 
header  is  received,  and  since  a  direct  connection  to  the  sender  is  available  for 
transmitting  this  signal,  this  scheme  offers  the  unusuad  feature  that  the  sender 
will  receive  the  acknowledgement  before  it  has  finished  sending  the  packet! 
This  allows  a  virtual  circuit  to  "pipeline"  a  stream  of  packets  through  an  other¬ 
wise  idle  node  without  incurring  the  delays  associated  with  waiting  for  ack¬ 
nowledgements  or  the  complexity  of  multiple  unacknowledged  packets. 


An  alternative  approach  to  flow  control  is  to  implement  the  buffer  allocation 
policy  for  a  node  in  its  neighboring  nodes,  i.e.  control  the  flow  of  information 
from  the  sender  rather  than  the  receiver  end  of  each  link.  For  example,  each 
output  port  could  maintain  a  table  remembering  how  many  buffers  in  the  neigh¬ 
boring  node  are  allocated  to  each  channel  of  the  link  connecting  the  two.  With 
this  information,  the  sender  can  decide  which  channel  to  serve  next,  and  pack¬ 
ets  can  be  forwarded  without  the  risk  of  overflowing  the  buffer  space  in  the 
receiver.  Maintaining  this  remote  status  information  requires  some  overhead: 
The  fact  that  the  receiver  has  freed  up  a  buffer  must  be  reported  back  to  the 
transmitter.  Finally,  since  packets  cannot  be  retransmitted,  transmission 
errors  in  packet  headers  result  in  lost  packets.  An  end-to-end  mechanism  is 
required  to  retransmit  these  packets. 

As  in  the  send /acknowledge  flow  control  scheme,  buffer  hogging  is 
prevented  by  controlling  the  number  of  buffers  used  by  each  channel.  It  might 
be  noted  however,  that  output  port  buffer  hogging  is  much  more  difficult  to 
prevent  This  is  because  the  size  of  the  queue  on  an  output  link  depends  on  the 
packets  received  from  the  node’s  neighbors.  When  these  neighbors  send  pack¬ 
ets,  they  do  not  know  which  output  port  in  the  receiving  node  the  packet  will 
use.  since  routing  decisions  are  made  inside  the  receiver.  Thus  the  neighbors 
cannot  control  the  queue  size  on  a  specific  link,  and  nothing  prevents  a  single 
port  from  monopolizing  the  entire  buffer  pool. 

The  send/acknowlcdge  protocol  leads  to  a  simple  implementation,  while  the 
remote  buffer  management  approach  prevents  rejected  packets,  and  thus 
avoids  retransmissions  and  waste  of  bandwidth.  Both  schemes  require  some 
overhead  to  provide  the  feedback  signals  necessary  for  flow  control.  In  the 
send/acknowledge  scheme,  dedicated  pins  are  used,  while  in  the  remote  buffer 
management  scheme,  piggy-backed  control  signals  are  required.  Implementa- 


159 


tion  and  comparisons  of  these  two  mechanisms  will  be  described  in  the  sections 
which  follow. 

4.4.  Implementation  of  VLSI  Communication  Components 

Hardware  implementations  of  the  communication  functions  described 
above  are  outlined  in  this  section,  and  two  designs  are  presented  which 
integrate  these  functions  into  a  single  chip.  The  first  is  a  Y-component  design 
using  a  set-associative  buffer  management  scheme  and  remote  buffer  allocation 
for  flow  control.  It  will  be  seen  that  there  are  a  number  of  severe  deficiencies  in 
this  design.  The  second  design,  which  corrects  these  deficiencies,  uses  a  fully 
associative  buffer  management  scheme.  Implementations  of  both  sender- 
controlled  and  receiver-controlled  flow  control  mechanisms  are  also  discussed. 
Common  to  both  designs  is  the  routing  controller  with  hardware  support  for 
hierarchical  routing.  This  is  the  subject  of  the  next  section.  The  two  designs  are 
described  in  subsequent  sections. 

4.4.1.  Routing  Hardware 

This  section  describes  a  hardware  implementation  of  the  hierarchical  rout¬ 
ing  table  mechanism  described  earlier.  This  hardware  is  part  of  the  routing  con¬ 
troller  which  is  responsible  for  setting  up  virtual  circuits  through  the  node.  The 
remainder  of  the  routing  controller  is  described  in  [FujiSO]. 

When  a  virtual  circuit  is  being  constructed,  the  routing  hardware  is  given  a 
hierarchical  destination  address,  and  must  determine  which  output  port  the  vir¬ 
tual  circuit  is  to  use.  This  is  accomplished  by  a  set  of  routing  tables,  one  for 
each  level  of  the  hierarchy,  as  discussed  earlier.  The  routing  controller,  part  of 
which  is  implemented  as  a  microprogrammed  engine,  is  responsible  for  loading 
and  maintaining  the  routing  tables,  e.g.  by  a  shortest  path  routing  algorithm. 
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Two  implementations  are  discussed.  The  first  assumes  that  routing  tables 
at  all  levels  are  the  same  size,  some  power  of  2.  The  second  design  relaxes  this 
assumption,  but  at  the  cost  of  added  complexity. 

An  i-level  hierarchical  node  address  consists  of  a  string  of  digits, 
Ai-xAi- z  ■  •  •  Aj/q.  Digit  At  is  used  to  index  the  routing  table  at  level  i.  A  routing 
table  entry  contains  either  an  output  port  number  indicating  which  port  to  use, 
or  a  ’’NULL”  fieg  indicating  that  the  table  on  the  next  level  must  be  searched. 
Let  RTi  denote  the  routing  table  at  level  i,  with  O^id.  The  algorithm  to  deter¬ 
mine  the  appropriate  output  port  Is  as  follows: 

level  :=  0;  /•  current  leveL  0,  1, ...  1-1  */ 

while  ((i?7j»«t[Atev,i]  =  NULL)  and  (level  <  l)) 
level  :=  level  +  1; 

if  (level  <Z) 

return  (7?Tj,v*t[Auv«z  ]):  /*  return  output  port  •/ 

else 

return  (NULL);  /*  destination  node  reached  */ 

If  this  routine  returns  NULL,  then  the  message  has  reached  it’s  final  destination, 
Le.  the  destination  address  matches  the  local  address.  Otherwise,  the  number 
of  the  output  port  selected  by  the  routing  algorithm  is  returned. 

One  hardware  implementation  of  this  table  lookup  mechanism  is  shown  in 
figure  4.4a.  It  is  assumed  that  each  table  contains  2*  entries.  The  bus  widths  in 
figure  4.4a  assume  that  there  are  8  levels,  and  4  routing  table  entries  in  each 
level  (i.e.  k- 2).  The  ’’address  register”  holds  the  destination  address.  The 
rightmost  k  bits  of  this  register  hold  A0,  the  next  k  bits  hold  At,  etc.  A  single 
RAM  holds  all  of  the  routing  tables.  The  upper  bits  of  the  address  lines  of  this 
RAM  specify  a  routing  table  (Le.  a  level),  and  the  lower  bits  specify  an  offset  into 
this  table.  The  lower  k  bits  of  the  address  register  (i.e.  in  the  program 
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above)  are  concatenated  with  the  output  of  the  level  counter  (the  current  level) 
to  form  this  address.  The  shifter  aligns  the  destination  address  bits  by  shifting 
out  \  and  moving  4+}  into  the  rightmost  position  each  clock  cycle.  Since  each 
bit  is  shifted  exactly  k  bits  on  each  clock,  the  shifter  can  be  implemented  by  an 
edge  triggered  register  and  a  simple  permutation  of  wires.  Finally,  not  shown  is 
the  control  logic  which  sequences  through  the  various  routing  tables.  Design  of 
this  finite  state  machine  is  straight-forward,  using  the  level  counter  and  circui¬ 
try  to  detect  NULL  routing  table  entries,  and  generating  signals  to  shift  the 
address  bits  and  increment  the  level  counter. 

A  second  implementation,  shown  in  figure  4.4b.  relaxes  the  "fixed  routing 
table  size"  restriction.  The  bus  widths  shown  in  this  figure  support  up  to  8  levels 
and  a  total  of  256  routing  table  entries.  The  number  of  levels  and  sizes  of  the 
various  routing  tables  is  programmable  at  system  initialization.  The  "address 
RAM"  holds  the  base  addresses  of  the  various  routing  tables.  Entry  i  contains 
the  base  address  of  the  routing  table  at  level  i.  The  routing  tables  are  again 
stored  in  a  single  RAM.  The  routing  table  offset,  A.  is  generated  by  masking 
appropriate  bits  of  the  address  register.  This  offset  is  added  to  the  base  address 
to  generate  an  address  for  the  routing  table  RAM.  A  barrel  shifter  aligns  data  in 
the  address  register  for  the  next  iteration.  The  mask  bits  and  the  number  of 
address  bits  to  be  shifted  are  stored  in  the  "mask  RAM”  and  "shift  RAM"  respec¬ 
tively.  These  RAMs  are  loaded  by  the  routing  controller  at  initialization. 
Together,  their  contents  describe  the  format  of  the  address  register.  The  con¬ 
trol  logic  for  this  second  implementation  is  virtually  the  same  as  that  of  the  pre¬ 
vious  design. 

4.4.2.  AY-Component  Design 

The  design  of  a  Y-component  has  been  studied  [WongSl].  Details  of  this 
design  will  be  repeated  here  as  an  example  of  one  implementation  of  the 
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functions  described  above.  A  block  diagram  for  this  design  is  shown  in  figure 
4.5.  The  component  consists  of  a  routing  controller  (R),  three  input  ports,  three 
output  ports,  and  three  buffer  modules  (B),  one  associated  with  each  input  port. 
It  will  be  assumed  that  there  are  c  input  and  c  output  channels  on  each  port, 
and  that  each  buffer  module  consists  of  6  data  buffers. 

When  a  packet  arrives  at  an  input  port,  it  is  placed  in  one  of  that  port’s 
buffers.  The  routing  controller  (which  can,  for  the  moment,  be  considered  a 
fourth  output  port)  and  the  other  two  output  ports  actively  search  this  input 
port’s  translation  table  and  buffers  to  locate  packets  destined  for  it.  When  such 
a  packet  is  found,  it  is  forwarded  and  the  buffer  is  marked  empty.  Some  addi- 
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Figure  4.5.  Block  diagram  of  Y-component. 
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tional  control  logic  ensures  that  packets  on  each  channel  are  forwarded  in  the 
order  in  which  they  arrived. 

Although  the  translation  table  and  the  buffer  memory  of  a  single  input  port 
can  both  be  read  at  the  same  time,  it  is  not  possible  to  simultaneous  perform 
two  reads  of  the  same  translation  table  or  buffer  memory.  To  avoid  conflicts, 
each  "major  clock  cycle"  (the  time  interval  to  transmit  or  receive  a  single  word 
of  data  over  the  link)  is  subdivided  into  4  "minor  clock  cycles",  and  these  minor 
cycles  are  statically  assigned  to  output  ports  to  time  multiplex  access  to  the 
buffers  and  translation  tables  without  contention.  It  is  assumed  that  each  trans¬ 
lation  table  and  buffer  memory  can  be  accessed  during  a  single  minor  clock 
cycle.  This  assignment  ensures  that  each  output  port  has  an  opportunity  to 
read  the  translation  table  and  buffer  memory  of  each  of  the  other  two  ports  (or 
in  the  case  of  the  routing  controller,  the  other  three  ports)  during  each  major 
clock  cycle. 

4.4.2. 1.  Buffer  Management  Hardware 

A  set-associative  buffer  management  scheme  is  used  in  this  design.  Of  the 
b  buffers  assigned  to  each  input  port,  each  channel  is  statically  assigned  (say)  4 
buffers  by  means  of  some  algorithm  for  mapping  channel  numbers  to  buffer 
addresses,  e.g.  channel  i  might  be  able  to  access  buffers  i,  i  +  1,  i+2,  and  i+3, 
where  all  sums  are  taken  modulo  6.  Thus,  several  channels  share  the  use  of 
each  buffer. 

Each  input  channel  has  four  status  bits  which  indicate  which  buffers  actu¬ 
ally  hold  a  packet  for  that  channel.  These  bits,  as  well  as  the  input  port’s  trans¬ 
lation  tables  are  scanned  by  the  other  two  output  ports  and  routing  controller  to 
locate  packets  which  must  be  forwarded. 
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4. 4.2.2.  Flow  Control  Hardware 


A  remote  buffer  management  scheme  is  used  for  flow  control.  The  buffers 
of  each  input  port  are  managed  by  the  neighbor  on  the  sender  side  of  the  link. 
In  other  words,  the  output  port  of  each  node  is  responsible  for  allocating  buffer 
space  in  the  neighboring  node  to  the  packets  it  sends. 

In  addition  to  the  input  port  status  bits  described  earlier,  each  output  port 
maintains  a  bit  map  indicating  which  of  the  buffers  of  the  input  port  on  the 
other  side  of  the  link  are  free,  and  which  are  in  use.  When  a  component  sends  a 
packet,  it  not  only  specifies  the  number  of  the  channel  the  packet  is  being  sent 
on,  but  also  the  buffer  that  the  receiving  component  is  to  use.  It  also  must  set 
the  appropriate  bit  of  the  bit  map  to  signify  that  the  remote  buffer  is  now  in  use. 
The  input  port  receiving  the  packet  then  loads  it  into  the  designated  buffer,  and 
sets  the  appropriate  input  port  status  bit  for  the  channel  the  packet  arrived  on, 
indicating  that  it  has  a  packet  waiting  to  be  forwarded.  When  the  appropriate 
output  port  sees  that  this  bit  has  been  set,  it  forwards  the  packet.  The  neighbor 
which  originally  sent  it  must  be  notified  that  this  buffer  is  now  free.  A  control 
byte  piggy-backed  onto  a  packet  going  to  this  neighbor  accomplishes  this  task 
(a  dummy  packet  is  created  if  there  is  no  traffic  in  this  direction).  Since  the 
sender  does  not  send  a  packet  unless  there  is  an  empty  buffer  on  the  neighbor¬ 
ing  node  to  receive  it,  buffer  overflow  cannot  occur. 

4.4.3.  Deficiencies  in  the  Y-Component  Design 

The  design  presented  above  suffers  from  a  number  of  deficiencies.  The 
most  severe  problem  arises  from  the  polling  scheme  used  to  determine  which 
channels  hold  packets  waiting  to  be  forwarded:  each  output  port  polls  the  trans¬ 
lation  tables  and  the  status  bits  of  the  other  input  ports.  If  there  are  c  channels 
per  port,  then  each  port  requires  c  major  clock  cycles  to  poll  all  of  the  input 
channels  of  the  other  two  ports  (two  channels,  one  from  each  port,  can  be  polled 
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in  one  clock  cycle).  If  a  packet  arrives  on  an  arbitrary  channel,  then  am  average 
of  c/2  clock  cycles  expire  before  that  channel  is  polled.  Later,  it  will  be  seen 
that  c  should  be  relatively  large,  say  128  or  256,  so  long  delays  result  from  this 
polling  scheme.  In  addition,  it  will  be  seen  that  the  number  of  buffers  in  each 
component  need  not  be  very  large,  say  16  or  32,  so  most  channels  do  not  have 
packets  waiting  to  be  forwarded.  Many  idle  channels  will  have  to  be  polled 
before  a  channel  with  data  is  found.  Thus,  channel  polling  is  an  unreasonably 
slow  and  inefficient  mechanism  to  locate  waiting  packets. 

The  remote  buffer  management  scheme  described  above  wastes  link 
bandwidth,  since  it  requires  more  overhead  than  is  actually  necessary.  In  the 
previously  described  scheme,  a  buffer  number  precedes  every  packet  sent  over 
the  link.  This  is  required  because  the  sender  allocates  buffers  in  the  receiving 
node.  The  allocation  function  could  be  controlled  by  the  receiver  however,  since 
the  sender  only  needs  to  be  sure  that  a  remote  buffer  exists  to  hold  each  packet 
it  sends,  and  does  not  need  to  know  the  address  of  the  remote  buffer.  Thus, 
buffer  numbers  need  not  be  transmitted  over  the  link.  A  single  counter  indicat¬ 
ing  the  number  of  free  buffers  in  the  remote  node’s  input  port  could  be  used  to 
provide  the  necessary  information  without  incurring  additional  overhead  on  the 
link.  Sending  a  packet  decrements  the  counter,  while  receiving  a  signal  indicat¬ 
ing  that  the  remote  node  forwarded  the  packet  will  increment  it.  The  receiver  is 
left  the  responsibility  of  determining  which  buffer  each  arriving  packet  should 
use.  This  approach  eliminates  the  need  to  send  buffer  numbers  over  the  link, 
and  thus  achieves  more  efficient  use  of  the  link’s  bandwidth.  Details  of  such  an 
approach  will  be  described  later. 

The  design  described  above  requires  several  memory  references  to  the 
same  memory  on  each  major  clock  cycle.  For  example,  the  translation  table 
polling  mechanism  requires  four  memory  references  per  clock.  In  addition,  if 
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two  output  ports  want  to  simultaneously  forward  packets  from  the  same  input 
port,  two  buffer  memory  reads  per  clock  are  required-  The  time  required  by 
these  memory  references  could  slow  the  clock  rate,  reducing  the  communica¬ 
tion  bandwidth  of  the  entire  network. 

Finally,  the  studies  which  follow  indicate  that  a  high  degree  of  buffer  shar¬ 
ing  is  desirable,  since  there  are  many  more  channels  than  buffers.  This 
increases  the  desirability  of  a  fully  associative  buffer  management  scheme.  One 
implementation  of  such  a  scheme  is  described  next 

4.4.4.  An  Alternative  Design 

In  order  to  remedy  the  deficiencies  described  above,  an  alternative  design 
for  a  communication  component  has  been  studied.  Unlike  the  previous  design, 
this  design  has  been  structured  in  such  a  manner  that  the  number  of  1/0  ports 
can  be  increased  without  adding  unduly  to  it’s  complexity.  A  fully  associative 
buffer  management  scheme  is  explored,  as  well  as  two  types  of  flow  control 
mechanisms. 

A  block  diagram  of  the  communication  component  design  is  shown  in  figure 
4.8.  Ihe  most  distinguishing  feature  of  this  design  is  a  single  pool  of  buffers 
shared  by  all  channels  of  the  component.  Since  all  packets  traveling  through  a 
node  must  use  this  pool,  it  must  provide  enough  bandwidth  to  avoid  becoming  a 
bottleneck.  This  is  achieved  by  interleaving  the  memory  16  ways,  assuming 
packets  consist  of  16  bytes.  Byte  t  of  each  packet  (ts0<16)  is  always  stored  in 
memory  module  t  ( ).  Each  of  the  p  ports  can  simultaneously  load  a  packet 
into  a  buffer,  provided  no  two  use  the  same  memory  module  at  the  same  time. 
In  the  worst  case,  p  packets  simultaneously  arrive  at  a  node.  Since  only  one 
port  can  be  granted  access  to  MMq.  additional  registers  are  required  to  tem¬ 
porarily  buffer  the  arriving  data  bytes  until  they  can  be  stored  in  WA&  On  the 
next  clock  cycle,  when  the  second  byte  of  each  packet  arrives,  one  of  these 
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newly  arriving  u>  los  1**111  be  loaded  into  MMlt  and  one  of  the  temporarily  buffered 
bytes  can  now  be  written  into  Similarly,  three  accesses  to  the  buffer  pool 

will  occur  on  the  third  clock,  and  so  on.  Eventually,  each  port  will  be  able  to 
access  a  different  memory  module  on  each  clock  cycle. 

If  the  links  can  transmit  one  data  byte  per  clock  cycle,  then  the  communi¬ 
cation  component  must  be  able  to  transport  p  bytes  from  the  input  ports  to  the 
memory  modules  in  each  clock.  A  high-speed,  time-multiplexed  bus  performs 
this  function.  Since  this  bus  remains  entirely  within  the  chip,  it  can  run  approx- 


Flgure  4.B.  Block  diagram  of  altematvue  design. 


imately  an  order  of  magnitude  faster  than  the  I/O  links,  which  require  off-chip 
communications  [Sequ7Bj.  A  second  high  speed  bus  carries  bytes  from  the 
memory  modules  to  the  output  ports.  Single-port  memories  can  be  used  in  the 
memory  modules  provided  the  control  logic  only  initiates  one  operation  -  for¬ 
ward  a  packet  cr  receive  a  packet  -  per  clock  cycle.  The  designs  which  follow 
assume  that  this  is  the  case. 

A  block  diagram  of  the  control  logic  module  is  shown  in  figure  4.7.  Let  us 
consider  the  events  which  occur  when  a  packet  arrives  at  the  node.  First,  the 
header,  Le.  the  input  channel  number,  arrives.  The  translation  table  is  read  to 
determine  which  output  port  and  channel  will  be  forwarding  the  packet.  The 
output  of  the  translation  table  is  sent  to  both  the  buffer  management  and  flow 
control  modules.  The  buffer  management  module  allocates  an  empty  buffer  to 
hold  the  newly  arriving  packet,  and  notes  the  location  of  this  buffer  as  well  as 
the  output  port/chnnnel  specified  by  the  translation  table.  This  information  will 
be  needed  when  it  is  time  to  forward  the  packet.  The  buffer  module  then  sends 
the  address  of  the  buffer  into  HMq.  and  the  packet  is  stored,  byte-by-byte,  into 
successive  memory  modules  on  subsequent  clock  cycles.  The  flow  control 
module  notes  that  this  output  channel  now  has  a  packet  waiting  to  be  forwarded. 
When  the  output  link  specified  by  the  translation  table  is  free,  the  flow  control 
logic  sends  a  signal  to  the  buffer  management  module  indicating  that  the  latter 
should  forward  the  next  packet  waiting  on  this  output  channel.  The  buffer 
management  module  finds  the  address  of  the  buffer  holding  this  packet  and 
sends  it  to  MM0.  The  packet  is  read  from  the  buffer  byte-by-byte,  and  forwarded 
over  the  output  link.  In  both  reading  and  writing  a  packet,  the  same  memory 
address  is  pipelined  from  memory  module  to  memory  module  on  successive 
clock  cycles.  Since  the  pipelined  structure  of  the  memory  modules  allows  the 
reading,  i.e.  forwarding,  of  a  packet  to  begin  before  all  of  it  arrives,  virtual  cut- 


through  is  easily  implemented. 

The  sections  which  follow  give  more  detailed  explanations  of  possible 
hardware  implementations  of  these  mechanisms.  The  next  section  describes 
two  implementations  of  a  fully  associative  buffer  management  scheme  which 
differ  in  the  number  of  buffers  each  virtual  circuit  can  hold  at  one  time.  It  is 


seen  that  a  significant  reduction  in  complexity  is  possible  if  this  number  is  res¬ 
tricted  to  one.  Foilowing  this,  two  possible  implementations  of  flow  control 
mechanisms  are  presented.  The  first  uses  a  send/acknowledge  protocol,  while 
the  second  uses  a  remote  buffer  management  scheme. 
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In  the  circuit  diagrams  which  follow,  the  widths  of  data  paths  are  based  on  a 
component  with  4  1/0  ports,  32  data  buffers,  and  128  channels  per  link.  Thus, 
port  numbers,  buffer  numbers,  and  channel  numbers  are  2,  5.  and  7  bits  in 
length  respectively.  These  choices  will  be  discussed  later  in  this  chapter. 

4.4.4. 1.  Buffer  Management  Hardware 

The  buffer  management  module  must  perform  two  functions: 

(1)  Locate  a  free  buffer  to  hold  a  newly  arriving  packet. 

(2)  Locate  the  next  packet  waiting  to  be  forwarded  on  a  particular  output  chan¬ 
nel 

Two  implementations  will  be  described  for  performing  these  functions.  The  first 
assumes  that  the  number  of  packets  waiting  to  use  a  given  output  channel  can 
be  larger  than  one.  The  second  restricts  this  number  to  be  at  most  one. 

Since  buffers  are  dynamically  assigned  to  virtual  channels  on  demand,  a 
mechanism  is  required  to  keep  track  of  which  buffers  are  assigned  to  which 
channels  at  any  given  time.  In  the  presented  solution,  this  task  is  accomplished 
by  “chaining”  the  buffers  waiting  to  be  forwarded  on  an  output  channel  into  a 
linked  list  for  that  channel.  When  a  packet  arrives,  it  is  placed  at  the  end  of  the 
linked  list  corresponding  to  the  channel  the  packet  is  to  be  forwarded  on  (read 
from  the  translation  table).  It  is  removed  from  the  list  after  it  has  been  suc¬ 
cessfully  transmitted  to  the  next  node.  The  linked  lists  are  managed  as  a  FIFO 
queue  to  ensure  that  packets  are  forwarded  in  the  same  order  in  which  they 
arrived.  The  mechanisms  for  managing  the  linked  lists  are  implemented  in 
hardware  so  that  packet  forwarding  can  proceed  as  quickly  as  possible.  A  block 
diagram  of  one  implementation  is  shown  in  figure  4.8a.  In  the  discussion  which 
follows,  it  is  assumed  that  each  p-port  component  has  b  buffers  and  c  input  (or 
output)  channels  per  port,  i.e.  exp  channels  per  node. 


A  buffer  consists  of  a  16  byte  data  portion,  which  is  physically  distributed 
across  the  memory  modules  in  figure  4.6,  and  a  pointer  word.  The  pointer  word 
indicates  the  address  of  the  next  buffer  in  this  buffer's  linked  list.  The  6 -word 
"link’*  RAM  in  figure  4.8a  holds  these  pointers.  Each  output  channel  has 
pointers  to  the  buffers  at  the  front  and  end  of  its  linked  list.  The  cxp-word 
"front"  and  "rear"  RAMs  in  figure  4.8a  perform  this  function.  Adding  a  new 
buffer  to  an  output  channel  implies  reading  the  rear  RAM  (to  find  the  last  buffer 
in  the  list),  and  writing  the  address  of  the  new  buffer  into  this  address  of  the  link 
RAM  (to  set  the  new  link)  as  well  as  the  rear  RAM  (to  set  the  pointer  to  the  new 
rear  element).  Deleting  an  entry  implies  reading  the  front  RAM  (to  get  the 
address  of  the  buffer  being  deleted),  reading  the  link  RAM  (to  get  the  new  front 
element),  and  writing  this  latter  address  into  the  front  RAM. 

Buffers  not  linked  to  any  channel  list  are  empty,  and  are  linked  together  in 
a  separate  "free  list".  A  register,  called  the  ''free"  register,  points  to  the  begin¬ 
ning  of  the  free  list.  The  arrival  of  a  new  packet  implies  removing  an  element, 
Le.  the  address  of  a  free  buffer,  from  the  free  list,  and  adding  this  address  to  an 
output  channel’s  linked  list.  Forwarding  a  packet  implies  removing  the  front 
element  from  the  channel  list,  and  adding  it  to  the  free  list.  Allowing  simultane¬ 
ous  access  to  different  memories,  a  buffer  can  be  added  to  or  deleted  from  a 
linked  list  in  four  and  three  clock  cycles  respectively  (where  each  memory 
reference  requires  one  clock  cycle).  The  operations  necessary  to  process  an 
arriving/departing  packet  are  shown  in  figure  4.9  below. 

The  complexity  of  the  design  described  above  can  be  reduced  significantly 
tf  the  restriction  is  made  that  any  output  channel  can  use  at  most  one  buffer  at 
a  time.  The  impact  of  this  restriction  on  system  performance  will  be  discussed 
later.  Most  of  the  hardware  for  managing  the  linked  lists  can  be  eliminated, 
since  the  lists  are  at  most  one  element  long.  This  allows  the  three  RAMs  in  figure 
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Packet  arrives  on  channel  "ich": 

clock  cycle  action 

comments 

1)  buffer  *-  free; 

address  of  free  buffer 

MARiinje  «-  free; 

get  ready  to  read  new  free  list  head 

if  (free  =  NULL)  abort; 

no  more  free  buffers 

2)  free  «-  Link[#./L!?teljk  ]; 

read  new  free  list  head 

3)  Ur\'k[MARixnic  ]  «-  NULL; 

mark  pointer  for  new  buffer 

temp  *■  Rear[ich]; 

locate  end  of  linked  list 

— k 

MARtpuc  *-  Rear[ich]; 

4)  Rear[ich]  «-  buffer; 

get  ready  to  add  to  end  of  list 

update  pointer  to  end  of  list 

if  (temp  =  NULL) 

if  channel  list  now  empty 

Frontfich]  *-  buffer 
else 

Link[/*M/tan*  ]  *-  buffer 

then  update  front  pointer 

else  update  previous  last  element 

Packet  forwarded  on  channel  "och’’: 

clock  cycle  action 

comments 

1)  buffer  *-  Front[och]; 

get  address  of  first  buffer  in  list 

MARlink  «-  Front[och]; 

2)  if  (buffer  =  NULL)  abort; 

abort  if  list  empty 

9 

temp  «-  Link[jt/A/?to*  ]; 

address  of  new  front  element 

3)  Front[och]  «-  temp; 

update  front  pointer 

Link[AfA/?{*,t  ]  *-  free; 

add  buffer  to  free  list 

free  ♦-  buffer; 

new  front  of  free  list 

if  (temp  =  NULL) 

check  if  list  now  empty 

Rear!  och]  *■  NULL 

figure  4.9.  Operations  to  send  and  receive  packets. 


4.8a  to  be  combined  into  one  RAM,  the  "channel-to-buffer"  RAM  shown  in  figure 
4.8b.  This  exp -word  RAM  maps  output  channels  to  buffer  addresses.  Word  i 
holds  the  address  of  the  buffer  currently  holding  a  packet  for  channel  i.  The  list 
of  free  buffers  is  replaced  by  a  b-bit  latch,  called  the  "free  buffer  latch".  The 
free  buffer  latch  is  implemented  as  a  bit-addressable  latch,  i.e.  a  memory  device 
which  is  written  as  a  RAM  (one  bit  at  a  time),  but  read  as  a  latch  (all  bits  in 
parallel).  Each  bit  indicates  the  status  of  a  buffer  free  (l)  or  in  use  (0). 

When  a  new  packet  arrives,  the  buffer  management  circuitry  must  perform 
two  operations,  assuming  the  flow  control  circuitry  has  first  established  that  the 
packet  can  be  accepted  (discussed  later): 
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(1)  Find  and  allocate  a  free  buffer. 

(2)  Note  the  location  of  the  buffer  so  that  the  packet  can  be  found  when  it  is 
time  to  forward  it. 

The  address  of  a  free  buffer  is  determined  by  a  priority  encoder  attached  to  the 
free  buffer  latch.  The  resulting  address  is  sent  to  MMq.  This  address  is  also  used 
to  clear  the  corresponding  bit  in  the  free  buffer  latch,  effectively  allocating  the 
buffer,  and  completing  the  first  operation.  The  second  operation  is  accom¬ 
plished  by  writing  the  address  of  the  selected  buffer  number  into  the  channel- 
to-buffer  RAM  at  the  memory  location  corresponding  to  the  output  channel 
responsible  for  forwarding  the  packet  (read  from  the  translation  table). 

Forwarding  a  packet  on  output  channel  i  also  requires  two  operations: 

(1)  Locate  the  buffer  holding  the  packet  for  channel  i. 

(2)  Release  th-  buffer. 

The  first  task  is  accomplished  by  reading  address  t  of  the  channel-to-buffer  RAM. 
The  resulting  address  is  used  to  set  the  corresponding  bit  in  the  free  buffer 
latch,  marking  the  buffer  free  to  be  used  by  other  packets,  thus  accomplishing 
the  second  task. 

Forwarding  a  packet  requires  the  time  of  two  memory  operations  since  the 
channel-to-buffer  RAM  read  must  be  completed  before  the  latch  write  can  be 
begun.  These  two  steps  are  easily  pipelined  however,  allowing  a  "send  packet" 
operation  to  be  initiated  every  clock  cycle.  The  operations  for  receiving  a 
packet  can  be  performed  in  a  single  clock  cycle  since  both  can  be  executed  con¬ 
currently.  This  is  in  contrast  to  the  four  clock  cycles  required  in  the  previous 
buffer  management  scheme  which  used  linked  lists. 
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4. 4. 4.2.  Flow  Control  Hardware:  Send /Acknowledge  Protocol 

In  the  send/acknowledge  flow  control  scheme,  each  node  sends  a  packet, 
and  receives  an  acknowledgement  signal  indicating  whether  the  receiver 
accepted  or  rejected  (i.e.  discarded)  the  packet  Packets  may  be  rejected 
because  of  buffer  space  limitations  or  transmissions  errors  in  the  header  and 
must  be  retransmitted  at  some  later  time.  As  discussed  earlier,  it  is  assumed 
that  each  link  uses  a  separate  control  line  to  carry  the  acknowledgement  signal 
back  to  the  sender  with  minimal  delay. 

In  order  to  prevent  buffer  hogging,  each  output  channel  may  not  use  more 
than  some  "channel  limit'*  of  buffers  at  any  one  time.  In  addition,  each  output 
port  may  not  use  more  than  some  "port  limit"  of  buffers.  Note  that  the  port 
and  channel  limits  only  restrict  the  number  of  buffers  the  port  and  channel  can 
use.  and  do  not  represent  an  a  priori  allocation  of  buffer  space. 

A  block  diagram  of  the  flow  control  circuitry  for  one  port  is  shown  in  figure 
4.10a.  The  circuitry  performs  two  functions: 

(1)  It  selects  a  channel  which  is  waiting  to  use  the  link  and  initiates  a  request 

(to  the  buffer  manager)  to  forward  the  next  packet  on  this  channel. 

(2)  It  accepts  or  rejects  arriving  packets. 

Note  that  the  flow  control  circuitry  does  not  deal  with  buffer  numbers.  The 
buffer  manager  keeps  track  of  which  buffers  are  assigned  to  which  channels. 

The  first  function  is  accomplished  by  the  "channel  FIFO"  shown  in  figure 
4.10a;  This  memory  lists  channels  with  packets  waiting  to  be  forwarded.  When 
the  link  is  ready  to  forward  a  packet,  the  first  element  of  the  channel  FIFO  is 
removed.  The  resulting  channel  number  is  sent  to  the  buffer  management  cir¬ 
cuitry  indicating  that  the  next  packet  on  this  output  channel  is  to  be  sent  over 
the  link.  If  this  packet  is  accepted  by  the  neighboring  node,  the  FIFO  element  is 
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Figure  4.10.  Flow  control  circuitry  (a) 
(b )  additional  circuitry  Jo i 
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discarded.  Otherwise,  the  channel  number  is  reentered  at  the  end  of  the  FIFO. 

The  remaining  circuitry  in  figure  4.10a  performs  the  second  function:  deter¬ 
mine  whether  an  arriving  packet  should  be  accepted  or  rejected.  A  packet  may 
be  rejected  foi  any  of  four  reasons: 

(1)  No  buffers  remain  in  the  node  to  hold  the  packet. 

(2)  The  parity  check  on  the  header  indicates  a  transmission  error. 

(3)  The  output  port  is  already  using  the  maximum  number  allowed. 

(4)  The  output  channel  is  already  using  the  maximum  number  allowed. 

The  flow  control  hardware  must  detect  each  of  these  cases,  and  generate  a  nega¬ 
tive  acknowledgement  should  any  of  them  arise.  If  none  of  the  conditions  arise, 
the  packet  is  accepted  and  a  positive  acknowledgement  returned. 

The  first  condition  is  detected  by  a  control  line  from  the  buffer  manage¬ 
ment  module  indicating  whether  the  free  buffer  pool  has  been  exhausted.  A 
NULL  pointer  in  the  free  register  in  figure  4.8a.  or  a  lack  of  *1*  bits  in  the  free 
buffer  latch  in  figure  4.8b  indicates  this  condition.  Similarly,  a  parity  checker  in 
the  input  port  detects  the  second  condition.  A  "port  counter"  is  .used  to  detect 
the  third.  This  counter  indicates  the  number  of  additional  buffers  that  port  can 
use  before  the  port  limit  is  reached.  The  counter  is  initially  set  to  the  port  limit, 
decremented  each  time  a  packet  is  accepted,  and  incremented  each  time  the 
port  successfully  forwards  a  packet.  If  the  output  of  the  counter  is  zero,  the 
port  cannot  accommodate  another  packet.  The  zero-detection  circuitry,  imple¬ 
mented  by.a  single  nor  gate,  identifies  this  situation. 

In  order  to  detect  the  final  condition,  a  channel  using  its  limit  of  buffers, 
circuitry  similar  to  the  port  counter  is  required  for  each  output  channel.  The 
c-word  "channel  RAM"  indicates  the  number  of  buffers  each  channel  can  use 
before  reaching  its  channel  limit  Entry  i  is  initially  set  to  the  channel  limit  for 
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channel  i.  is  decremented  each  time  a  packet  is  accepted  by  that  output  chan¬ 
nel,  and  is  incremented  when  the  channel  successfully  forwards  a  packet.  If  a 
packet  arrives  and  the  corresponding  channel  RAM  entry  is  zero,  the  packet 
must  be  rejected  and  a  negative  acknowledgement  returned. 

If  the  restriction  is  made  that  each  channel  can  use  at  most  one  buffer  at  a 
time,  then  the  circuitry  in  figure  4.10a  can  be  simplified.  The  channel  RAM  is 
now  one  bit  wide,  and  indicates  whether  or  not  the  channel  has  a  packet  waiting 
to  be  forwarded.  The  increment/decrement  circuit  attached  to  the  channel 
RAM  is  no  longer  needed,  since  accepting  a  packet  implies  setting  a  bit  in  the 
RAM,  and  forwarding  a  packet  implies  reseting  a  bit.  Similarly,  the  channel 
RAM's  zero-detection  circuitry  is  not  required,  since  only  a  single  bit  is  output 
from  the  RAM. 

4.4. 4.3.  Flow  Control  Hardware:  Remote  Buffer  Management 

In  this  scheme,  each  output  port  maintains  enough  information  about 
buffer  allocation  in  its  neighboring  nodes  to  determine  which  channels  may  send 
packets,  and  which  must  wait  A  channel  must  wait  if  it  is  using  "too  many"  of 
its  neighbor’s  buffers,  or  if  there  is  insufficient  free  buffer  space  to  hold  a  new 
packet  Control  decisions  are  made  by  the  sender,  and  receivers  must  accept 
all  packets  sent  over  the  link.  Eventually  the  receiver  will  forward  the  packet  to 
a  third  node.  When  this  happens,  the  receiver  reports  back  to  the  sender  by 
sending  a  "release  channel  number"  indicating  that  the  buffer  is  again  available 
for  another  packet.  This  indicates  the  number  of  the  channel  which  originally 
sent  the  packet 

In  addition  to  restricting  the  number  of  buffers  each  channel  can  use  at  one 
time,  a  "port  limit”  restricts  the  number  of  buffers  each  output  port  can  use. 
Both  the  port  and  channel  limits  are  initialized  by  the  local  routing  controller. 
If  the  port  limit  is  pi,  then  pi  buffers  are  reserved  in  the  neighboring  node  for 
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use  by  this  output  port.  This  is  in  contrast  to  the  send/acknowledge  design  in 
which  the  port  limit  only  restricts  the  number  the  port  can  use,  but  does  not 
actually  reserve  buffer  space.  Because  no  retransmissions  are  used,  the  remote 
buffer  management  scheme  must  reserve  buffers  to  avoid  overflows,  since  this 
will  result  in  lost  packets. 

The  flow  control  circuitry  must  perform  four  functions: 

(1)  It  selects  u  channel  which  is  waiting  to  use  the  link,  and  initiates  a  request 
(to  the  buffer  manager)  to  forward  the  next  packet  on  this  channel. 

(2)  It  generates  release  channel  numbers  to  previous  nodes. 

(3)  It  processes  incoming  release  channel  numbers. 

(4)  It  maintains  information  of  buffer  allocation  in  receiving  nodes. 

The  physical  circuitry  for  performing  the  first  function  is  virtually  the  same 
as  that  shown  in  figure  4.10a  for  the  send/acknowledge  scheme,  however  the 
logical  meaning  of  the  information  kept  in  the  channel  RAM  is  different  Instead 
of  indicating  the  number  of  local  buffers  below  the  channel's  limit  in  the  local 
node,  the  RAM  indicates  the  number  of  remote  buffers  below  the  limit  in  the 
neighboring  node.  As  long  as  entry  i  is  not  zero,  channel  i  may  send  another 
packet  assuming  the  port  limit  has  not  been  reached.  Similarly,  the  port 
counter  also  refers  to  buffers  in  the  neighboring  node  used  by  this  output  port. 

When  a  packet  arrives,  the  number  of  the  output  channel  responsible  for 
forwarding  the  packet  is  added  to  the  end  of  the  channel  FIFO.  Since  all  packets 
are  accepted,  no  further  processing  is  required.  To  forward  a  packet,  the  next 
entry  in  the  channel  FIFO  is  removed.  The  corresponding  channel  number  is 
used  to  address  the  channel  RAM.  If  the  corresponding  entry  of  the  channel  RAM 
is  not  zero,  and  if  the  port  counter  is  not  zero,  then  the  channel  number  is  sent 
to  the  buffer  module  and  the  next  packet  on  this  channel  is  sent  over  the  link. 


The  port  counter  and  channel  RAM  entry  are  then  decremented.  If  either  of  the 
counters  was  zero,  the  channel  cannot  forward  the  packet  so  the  channel 
number  is  reentered  at  the  end  of  the  channel  FIFO. 

Each  time  a  packet  is  forwarded,  a  release  channel  number  must  be  sent 
back  to  the  neighboring  node  which  sent  the  packet.  The  circuitry  in  figure 
4.10b  performs  this  function.  In  order  to  generate  release  channel  numbers, 
information  must  be  kept  with  each  packet  that  indicates  which  input  port  and 
channel  it  arrived  on.  A  6 -word  "release  RAM”  accomplishes  this  task.  Element 
i  indicates  the  input  port  and  channel  number  the  packet  in  buffer  i  arrived  on. 
This  information  is  loaded  into  the  release  RAM  when  a  packet  arrives  and  read 
when  it  is  forwarded.  A  small  flfo  buffer  in  each  output  port  holds  the  channel 
number  portion  of  the  word  until  it  can  be  forwarded  to  the  neighbor  which  sent 
the  packet. 

Finally,  when  a  release  channel  number  is  received  from  a  neighboring 
node,  indicating  that  a  certain  channel  is  using  one  fewer  buffer,  the  count  of 
buffers  the  channel  is  allowed  to  use  must  be  incremented.  The  appropriate 
entry  of  the  channel  RAM  is  read,  incremented,  and  written  back  into  the  RAM. 
completing  the  processing  of  the  release. 

The  simplifications  resulting  from  constraining  each  output  channel  to 
using  at  most  one  remote  buffer  at  a  time  are  similar  to  those  described  in  the 
send/acknowlcdgc  scheme.  The  channel  RAM  is  again  one  bit  wide.  Forwarding 
a  packet  resets  a  bit  in  the  RAM,  effectively  disabling  the  channel.  Receiving  a 
release  channel  number  causes  the  bit  to  be  set,  reenabling  the  channeL 

4.5.  Evaluation  of  Communication  Component  Parameters 

In  order  to  evaluate  the  amount  of  circuitry  required  to  implement  the 
communication  component  described  above,  estimates  a r?  required  of: 
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(1)  the  number  of  I/O  Ports 

(2)  the  number  of  virtual  channels 

(3)  the  number  of  buffers. 

Chapters  2  and  3  examined  the  first  question  in  detail  and  concluded  that  from 3 
to  5  I/O  ports  should  be  used.  The  remaining  two  questions  will  be  discussed 
next. 
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4.5.1.  Number  of  'Virtual  Channels 

Because  each  virtual  channel  requires  a  certain  amount  of  overhead  circui¬ 
try,  the  number  of  channels  must  be  limited.  In  addition,  it  is  desirable  to  limit 
the  number  of  channels  on  each  link  to  prevent  “overbooking"  the  link’s 
bandwidth,  since  this  will  lead  to  long  queueing  delays  on  the  link  and  to  poor 
performance.  On  the  other  hand,  providing  too  few  channels  per  link  will  lead  to 
a  high  failure  rate  in  establishing  virtual  circuits,  deadlock  situations,  and 
underutilization  of  the  link’s  bandwidth.  Thus  the  number  of  virtual  channels 
per  link  must  be  chosen  to  achieve  good  link  utilization  without  incurring  an 
excessive  amount  of  overhead  circuitry. 

First,  link  utilization  will  be  used  to  determine  the  proper  number  of  chan¬ 
nels  per  link.  The  overhead  issue  will  be  ignored  for  now.  Deadlocks  can  be  bro¬ 
ken  with  an  end-to-end  timeout  mechanism,  as  will  be  disussed  later. 

Each  virtual  circuit  using  a  link  requires  a  certain  amount  of  bandwidth. 
Since  the  bandwidth  provided  by  each  link  is  fixed,  each  link  can  support  a  large 
number  of  circuits  with  low  bandwidth  requirements,  or  a  small  number  of  cir¬ 
cuits  with  high  bandwidth  requirements.  U  a  large  number  of  channels  are  pro¬ 
vided  to  accommodate  the  former  case,  a  number  of  high  bandwidth  circuits 
may  use  the  link  and  overbook  the  available  bandwidth.  If  a  small  number  of 
channels  are  provided,  much  of  the  link's  bandwidth  will  be  wasted  when  many 
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low  bandwidth  circuits  monopolize  the  available  channels. 

One  approach  to  resolving  this  dilemma  is  to  provide  enough  channels  to 
accommodate  a  large  number  of  low  bandwidth  circuits,  but  to  also  provide  a 
separate  mechanism  which  prevents  overbooking  the  link's  bandwidth.  New  cir¬ 
cuits  may  not  be  established  on  the  link  if  the  link’s  bandwidth  has  been  fully 
booked,  regardless  of  the  number  of  unallocated  channels  remaining.  Later, 
when  existing  circuits  using  the  link  are  torn  down,  new  circuits  could  again  be 
established. 

In  order  to  implement  this  mechanism,  the  bandwidth  requirements  of  each 
circuit  must  be  estimated.  This  could  be  accomplished  statically  when  the  cir¬ 
cuit  is  established  (e.g.  the  operating  system  may  be  able  to  provide  this  infor¬ 
mation  based  on  the  type  of  traffic  expected  over  the  circuit),  or  dynamically, 
"on  the  fly”,  by  measuring  traffic  on  the  circuit.  Of  course,  the  latter  scheme 
has  the  disadvantage  that  link  bandwidth  may  still  be  overbooked  since  the 
amount  of  bandwidth  required  by  the  circuit  is  not  known  until  after  it  is  esta¬ 
blished,  i.e.  a  high  bandwidth  circuit  may  be  established  over  the  link  before  it  is 
known  that  its  bandwidth  requirements  overbook  the  link. 

Both  the  static  and  the  dynamic  schemes  could  be  implemented  by  associ¬ 
ating  a  "link  bandwidth  indicator"  with  each  output  link  which  indicates  the 
anticipated  bandwidth  requirements  of  circuits  using  the  link.  When  this 
bandwidth  indicator  exceeds  some  threshold,  no  more  traffic  is  routed  over  that 
link.  Jn  the  first  scheme,  using  "hints"  from  the  operating  system,  a  field  in  the 
packet  which  sets  up  the  virtual  circuit  could  indicate  the  anticipated  bandwidth 
requirements  of  that  circuit.  The  link  indicator  is  increased  by  the  value  of  this 
field  when  the  circuit  is  set  up,  and  decreased  when  the  circuit  is  torn  down. 
The  value  of  this  field  must  be  included  in  the  packet  tearing  down  the  circuit  as 
well  as  the  header,  since  the  component  does  not  keep  track  of  the  bandwidth 
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requirements  of  each  channel. 

In  the  dynamic  scheme,  the  bandwidth  indicator  could  be  incremented 
each  time  a  packet  is  sent  on  that  link.  Periodically,  the  routing  controller 
examines  the  indicator  to  determine  if  the  link  is  overbooked,  and  then  clears  it. 
Finally,  a  third  alternative  is  to  measure  the  average  queue  length  on  each  link, 
and  declare  the  link  overbooked  if  this  average  exceeds  a  certain  threshold. 
Again  however,  these  two  schemes  do  not  prevent  overbooking  the  link's 
bandwidth,  but  rather  attempt  to  prevent  a  bad  situation  from  becoming  worse. 

If  a  separate  mechanism  is  used  to  prevent  overloading  the  link,  each  com¬ 
ponent  should  ideally  provide  an  unlimited  number  of  channels  since  this 
guarantees  that  it  will  never  needlessly  block  circuits  trying  to  use  a  link  with 
excess  capacity.  This  number  can  be  reduced  however,  if  the  minimum 
bandwidth  requirements  of  any  virtual  circuit  can  be  established.  The  maximum 
number  of  channels  the  link  will  ever  require  can  be  calculated  by  dividing  the 
total  link  bandwidth  by  this  minimum  channel  bandwidth,  as  will  be  derived 
below. 

In  this  context,  two  questions  must  be  considered: 

(1)  How  many  minimum  bandwidth  circuits  can  be  maintained  on  a  fixed 
bandwidth  link? 

(2)  How  much  traffic  corresponds  to  a  minimum  traffic  load? 

The  first  question  lends  itself  to  a  precise  mathematical  analysis,  and  will  be  dis¬ 
cussed  next.  The  second  is  more  difficult  to  resolve  since  it  is  application  pro¬ 
gram  dependent.  It  will  be  addressed  later. 

Given  an  expected  traffic  load  on  each  circuit,  the  proper  number  of  chan¬ 
nels  can  be  estimated  via  a  queueing  model.  The  model  for  the  traffic  load  on 
each  communication  link  is  shown  in  figure  4.11.  The  n  virtual  channels  using 
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Figure  4.11.  Queueing  model  far  analyzing  number  of  channels. 

the  link  are  modeled  as  a  single  server  queue  with  traffic  arriving  from  n 
sources.  It  will  be  assumed  that  packet  arrival  times  on  each  virtual  circuit  fol¬ 
low  a  Poisson  distribution.  Since  fixed-length  packets  are  used,  service  times 
are  deterministic,  resulting  in  an  M/G/l  queueing  model.  From  the  Pollaczek- 
Khinchin  mean  value  formula  [Klei75],  the  average  time  If  each  packet  spends 
in  the  queue  waiting  for  the  link  is 


If  = 


UL 


2(1 -p) 

where  X  is  the  service  time,  i.e.  the  time  required  to  transmit  a  packet,  and  p  is 
the  link  utilization.  Assuming  the  average  arrival  rate  on  each  of  the  n  virtual 
circuits  is  X  messages  per  second,  or  'kmi  bits  per  second,  where  is  the 
packet  length  in  bits,  the  utilization  of  a  link  with  a  bandwidth  of  6  bits  per 
secon'd  is 
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w  _  nXmf 

2  b  (b  —  nXm*) 

To  determine  numerical  estimates  of  the  number  of  channels,  consider  the 
number  of  virtual  circuits  required  to  drive  the  link  utilization  p  to  1,  which  in 
turn  drives  the  average  waiting  time  to  infinity: 


Figure  4.12  shows  a  plot  of  this  quantity  for  an  80  Mbit/second  link  (a  byte  wide 
link  running  at  1U  Mnz)  as  a  function  of  1/X,  the  mean  time  between  packets  on 
each  circuit.  This  plot  also  assumes  that  packets  are  17  bytes  in  length. 

The  critical  parameter  in  evaluating  the  number  of  channels  is  the 
expected  traffic  load  on  each  virtual  circuit  Unfortunately,  the  bandwidth 
requirements  of  each  circuit  may  be  arbitrarily  small,  implying  an  arbitrarily 
large  number  of  channels  should  be  supported.  In  reality  however,  seldomly 

MAXIMUM  NUMBER  OF  CHANNELS 
vs.  CHANNEL  LOADING 
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used  virtual  circuits  may  be  tom  down  and  reestablished  as  necessary  to  reduce 
the  number  of  channels  required.  Since  reestablishing  a  circuit  incurs  consider¬ 
ably  more  delay  than  sending  a  message  on  an  existing  circuit,  these  lightly 
loaded  circuits  must  not  have  low  latency  requirements. 

In  order  to  determine  numerical  estimates  of  virtual  circuit  loading,  the 
average  time  between  messages  on  virtual  circuits  was  measured  for  the  five 
application  progi  oms  described  in  chapter  3  (the  artificial  traffic  load  program 
is  excluded)  assuming  negligible  communication  delays.  These  arrival  rates  are 
shown  in  table  4.1  below,  and  represent  rates  averaged  over  all  virtual  circuits  in 
the  application  program  weighted  according  to  the  number  of  messages  sent  on 
each  circuit.  If  n*  and  \  are  respectively  the  number  of  messages  sent  and  the 
average  arrival  rate  on  virtual  circuit  i,  then  the  overall  average  arrival  rate  is 
computed  as: 

‘■Sr?"1*" 

i 

Standard  deviations  for  the  interarrival  times  are  also  shown  in  table  4.1. 
The  zero  value  in  the  FFT  program  is  due  to  the  regularity  of  its  structure:  All 
tasks  iteratively  perform  the  same  computation,  so  the  time  between  messages 
is  always  the  same.  The  other  values  reflect  the  fact  that  the  programs  typically 
use  two  types  of  circuits  -  those  with  little  or  no  computation  between  messages, 
e.g.  the  circuits  distributing  the  initial  data  samples  in  the  signal  processing 
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_  Table  4. 1  _  _ 

AVERAGE  ARRIVAL  RATES  PER  VIRTUAL  CIRCUIT 
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ARRIVAL  INTERARRIVAL  STANDARD  NUMBER  OP 
RATE  TIME  DEVIATION  CHANNELS 

microseconds)  (microseconds)  (80  Mbit  link 


100.6 

40.4 

59 

58.0 

0.0 

34 

34.2 

43.8 

20 

22.1 

23.2 

13 

11.7 

9.2 
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programs,  and  those  with  more  significant  computations  between  messages. 
These  latter  circuits  have  an  average  interarrival  time  which  is  much  larger 
than  the  former. 

Figure  4.13  shows  the  average  waiting  time  to  use  an  BO  Mbit/second  link 
loaded  with  virtual  circuits  which  each  carry  an  artificial  traffic  load 
corresponding  to  the  average  values  listed  in  table  4.1.  The  curves  show  that  up 
to  59  channels  should  be  allowed  for  the  program  exhibiting  the  lightest  traffic 
load  —the  Barnwell  signal  processing  program.  This  value  is  also  listed  in  table 
4.1  along  with  the  corresponding  values  fur  the  other  application  programs. 
Thus,  for  workloads  similar  to  those  discussed  here,  it  would  be  reasonable  to 
provide  up  to  64  channels  on  each  link.  If  one  considers  the  standard  deviation 
on  the  Barnwell  program,  one  could  argue  that  this  figure  should  be  raised  to 
12B,  since  it  is  possible  that  most  of  the  circuits  using  a  link  could  by  chance  fall 
below  the  average  arrival  rate. 

As  discussed  earlier,  the  tasks  in  the  application  programs  listed  in  table 
4.1  communicate  relatively  frequently.  Programs  requiring  less  frequent  com¬ 
munication  use  circuits  that  are  more  lightly  loaded,  implying  each  link  could 
support  even  more  channels.  Thus,  the  figure  derived  above  should  be  con¬ 
sidered  a  lower  bound  rather  than  an  absolute  estimate  of  the  number  of  chan¬ 
nels.  Since  there  may  be  any  number  of  circuits  which  communicate  infre¬ 
quently,  but  which  require  low  latency  (making  it  unreasonable  to  tear  down  and 
reestablish  the  circuit  each  time  a  message  is  sent),  more  than  64  channels 
should  be  provided.  However,  increasing  the  number  of  channels  increases  the 
size  of  the  channel  number  that  must  precede  each  packet,  reducing  the 
amount  of  bandwidth  available  for  transmitting  data.  A  compromise  between 
these  conflicting  considerations  is  to  provide  128  or  256  channels  per  link,  since 
this  allows  more  than  64  channels,  but  still  confines  the  channel  number  over- 
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Figure  4. 13.  Waiting  time  vs.  number  of  channels  for  various  programs. 


head  to  a  single  byte. 

Finally,  let  us  consider  the  impact  future  improvements  in  technology  will 
have  on  the  number  of  channels.  Both  computing  speeds  and  link  bandwidths 
can  be  expected  to  improve.  Higher  bandwidths  imply  that  each  link  could  sup¬ 
port  more  channels.  On  the  other  hand,  higher  computation  rates  lead  to  higher 
traffic  loads,  implying  fewer  channels  are  necessary.  If  switching  speeds 
increase  at  a  faster  rate  than  communication  rates,  then  we  can  expect  virtual 
circuit  loading  to  be  the  more  dominant  factor,  implying  fewer  channels  per 
link.  On  the  other  hand,  if  communication  rates  progress  at  a  higher  pace,  then 
more  channels  may  be  provided.  While  switching  speeds  can  be  expected  to 
improve  by  an  order  of  magnitude  over  the  next  20  years  [Keye79],  fiber  optic 
links  may  lead  to  much  larger  improvements,  implying  more  channels  may  be 
supported. 
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4.5.2.  Amount  of  Buffer  Space 

Technological  capabilities  limit  the  amount  of  buffer  space  that  can  be  pro¬ 
vided  by  each  communication  component.  On  the  other  hand,  insufficient  buffer 
space  will  lead  to  performance  degradations,  since  communication  bandwidth  is 
wasted  and  delays  increased  if  buffers  are  not  available  to  hold  arriving  packets. 

In  extreme  cases,  buffer  deadlock  will  result.  Buffer  deadlock  is  a  situation 
in  which  message  traffic  comes  to  a  complete  halt  because  a  set  of  nodes  have 
exhausted  all  of  their  available  buffer  space.  Each  node  cannot  forward  a  packet 
because  no  buffer  is  available  to  receive  it,  and  each  node  cannot  free  a  buffer 
because  no  packets  can  be  forwarded.  An  example  of  such  a  deadlock  situation 
is  shown  in  figure  4.14,  where  each  node  has  a  single  buffer  holding  a  packet 
waiting  to  be  forwarded.  The  network  will  remain  deadlocked  until  a  packet  is 
discarded,  releasing  a  buffer. 


figure  4. 14.  Example  of  buffer  deadlock. 
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Thus,  sufficient  buffer  space  must  be  provided  to: 
p  (l)  avoid  buffer  deadlock. 

(2)  ensure  good  performance. 

Each  of  these  issues  will  now  be  discussed  in  turn,  followed  by  results  from  simu- 
i  lation  studies  that  help  to  determine  the  buffering  requirements  of  each  com¬ 

munication  component. 

4.5.2. 1.  Buffer  Space:  Deadlock  Considerations 

9 

Buffer  deadlock  can  be  prevented  if  enough  buffer  space  is  provided  in  each 
communication  component.  A  brute  force  solution  is  to  provide  each  virtual 
channel  with  it’s  own  buffer.  Since  each  circuit  is  allocated  a  buffer  in  each 
node  it  passes  through,  traffic  on  a  circuit  cannot  be  blocked  by  traffic  on  other 
circuits,  so  buffer  deadlock  cannot  occur.  Providing  a  separate  buffer  on  each 
channel  is  wasteful  however,  since  each  component  will  have  to  provide  as  many 
buffers  as  there  are  channels.  It  will  be  seen  that  virtually  the  same  perfor¬ 
mance  can  be  achieved  if  many  channels  share  a  much  smaller  pool  of  buffers. 

An  alternative  approach  to  avoiding  buffer  deadlock  is  outlined  in  [MerlBOa, 
Merl80b].  Here,  each  node  must  have  at  least  fimax  buffers,  where  Fm*x  is  the 
maximum  number  of  hops  traversed  by  any  virtual  circuit.  The  buffer  pool  in 
each  node  is  partitioned  into  H*..  disjoint  pools  or  levels,  say  1,2,  •  •  •  H 
Each  node  maintains  a  "hop  count"  for  each  circuit  passing  through  the  node 
Indicating  the  number  of  hops  the  circuit  has  traversed  from  the  source  node  to 
the  current  node.  The  hop  count  is  set  when  the  virtual  circuit  is  first  esta¬ 
blished.  A  packet  arriving  at  a  node  on  a  circuit  with  hop  count  i  may  only  be 
placed  in  one  of  the  buffers  in  level  t.  K  can  be  shown  that  buffer  deadlock  will 
never  occur  in  this  scheme. 
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The  central  disadvantage  of  this  scheme  is  that  large  networks  require 
more  buffers  than  smaller  ones,  so  the  communication  component  must  provide 
enough  buffers  to  accommodate  the  largest  network  it  will  ever  become  a  part 
of.  This  may  require  an  excessively  large  amount  of  buffer  space.  In  addition,  if 
traffic  is  highly  localized,  many  buffers  are  wasted  because  those  reserved  for 
higher  hop  counts  are  never  utilized. 

Partitioning  the  buffer  pool  also  adds  a  certain  amount  of  complexity  to.  the 
component.  The  partitioning  can  be  accomplished  by  limiting  the  total  number 
of  buffers  used  by  circuits  at  the  same  level,  rather  than  physically  partitioning 
the  buffer  space.  For  example,  if  H ^  is  10  and  there  are  20  buffers  in  the 
buffer  pool,  it  is  sufficient  to  use  the  buffer  management  schemes  described 
above,  and  ensure  that  the  circuits  at  any  given  level  collectively  use  no  more 
than  2  buffers  at  one  time.  This  could  be  implemented  with  a  counter  at  each 
level  indicating  the  number  of  free  buffers  currently  available  at  that  level.  A 
packet  is  rejected  if  no  more  buffers  are  available  at  it’s  particular  level. 

Implementation  of  this  scheme  with  a  remote  buffer  management  policy  for 
flow  control  is  more  difficult,  since  different  nodes  compete  for  the  buffers  in 
each  level.  The  buffer’s  in  each  level  must  be  further  partitioned  among  the 
neighboring  nodes  to  avoid  overflow  within  each  level. 

A  third  approach  to  resolving  the  deadlock  issue  is  to  allow  deadlock  to 
occur,  but  to  incorporate  a  mechanism  that  ensures  that  deadlocks  are  broken. 
Since  detecting  and  breaking  a  deadlock  may  be  a  time  consuming  operation, 
enough  buffer  space  should  be  provided  to  ensure  that  deadlocks  occur  infre¬ 
quently.  The  deadlock  breaking  mechanism  could  be  implemented  as  a  side 
effect  of  an  end-to-end  protocol  using  timeout  counters  to  retransmit  lost  mes¬ 
sages.  In  such  a  scheme,  each  message  sent  over  a  virtual  circuit  must  be  ack¬ 
nowledged  by  the  receiver.  If  an  acknowledgement  is  not  returned  after  a  cer- 


tain  amount  of  time,  the  sender  assumes  that  the  message  was  lost  and  must  be 
retransmitted.  In  order  to  avoid  duplicate  packets,  the  sender  must  first 
’‘clear"  the  virtual  circuit  by  sending  a  special  packet  which  flows  through  the 
circuit  and  destroys  all  packets  it  encounters.  It  then  resends  the  lost  message 
If  deadlock  occurs,  timeouts  will  result  and  circuits  will  be  cleared.  This 
releases  buffer  space  and  breaks  the  deadlock.  Since  such  a  timeout  mechan¬ 
ism  is  already  required  to  retransmit  lost  packets,  the  deadlock  breaking 
mechanism  incurs  virtually  no  additional  cost.  In  this  scheme,  communication 
components  are  not  required  to  ensure  that  buffer  deadlock  never  occurs,  so 
they  need  not  provide  excessive  amounts  of  buffer  space.  Thus  this  mechanism 
appears  to  be  an  attractive  one  for  solving  the  buffer  deadlock  problem. 

It  might  be  noted  that  a  similar  mechanism  could  be  used  to  break 
deadlocks  arising  when  links  use  all  of  their  virtual  channels.  Expiration  of  an 
end-to-end  timeout  during  the  set-up  of  a  virtual  circuit  could  trigger  the 
release  of  a  special  packet  which  destroys  the  partially  completed  circuit, 
releasing  channels,  and  breaking  the  deadlock.  This  protocol  also  requires  an 
end-to-end  acknowledgement  to  mark  the  establishment  of  each  virtual  circuit. 

Simulation  experiments  similar  to  those  discussed  in  chapter  3  were  car¬ 
ried  out  to  evaluate  the  number  of  buffers  each  communication  component 
should  provide  to  avoid  buffer  deadlock.  The  six  application  programs  discussed 
in  chapter  3  were  executed  on  Simon  with  a  switch  model  for  a  hexagonal  lattice 
network  Hulls  ?rnm  4-noj-t  communication  components  (figure  3. 14a). 

The  first  set  of  experiments  assumed  that  each  component  provides  b 
buffers,  and  that  no  restrictions  are  made  on  buffer  sharing,  i.e.  virtual  circuits 
may  use  as  many  buffers  as  are  available.  A  large  number  of  buffers,  over  100 
for  some  of  the  p^o  ams,  was  required  in  each  component  to  avoid  buffer 


deadlock. 


Figure  4.15.  Example  of  congestion  leading  to  deadlock. 

Buffer  hogging  was  at  the  root  of  these  deadlock  problems.  Consider  the 
situation  in  figure  4.15.  Virtual  circuits  1  and  2  join  at  node  A,  sharing  the  link 
from  A  to  B,  and  virtual  circuit  3  uses  the  link  from  B  to  A  Suppose  all  three 
circuits  carry  a  steady  stream  of  packets,  or  equivalently,  suppose  a  burst  of 
packets  simultaneously  arrives  on  each  circuit.  Since  the  flow  of  packets  into 
node  A  on  circuits  1  and  2  exceeds  the  flow  from  A  to  B  (the  latter  is  limited  by 
the  capacity  of  the  link  from  A  to  B)  a  queue  begins  to  form  at  node  A.  The 
queue  will  grow  until  the  free  buffer  pool  in  node  A  is  exhausted.  When  this  hap¬ 
pens  traffic  on  circuit  3  is  blocked,  and  a  queue  of  packets  begins  to  grow  in 
node  B.  Eventually,  B's  free  buffer  pool  will  also  be  exhausted,  blocking  traffic 
on  circuits  1  and  2.  The  network  is  now  deadlocked,  and  will  remain  in  this  state 
until  packets  are  discarded. 

The  scenario  described  above  can  be  avoided  if  precautions  are  taken  to 
avoid  buffer  hogging,  e.g.  by  restricting  the  number  of  buffers  each  circuit  can 
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use.  The  simulation  experiments  were  repeated  assuming  each  circuit  could  not 
use  more  than  one  buffer  in  each  node  at  one  time.  It  was  found  that  12  buffers 
were  sufficient  to  avoid  buffer  deadlock  in  all  six  application  programs. 

The  simulation  experiments  thus  demonstrate  the  need  to  provide  a  flow 
control  mechanism  which  prevents  buffer  hogging.  The  studies  also  indicate 
that,  it  is  reasonable  to  provide  each  component  with  a  few  tens  of  buffers,  say 
32.  to  reduce  the  probability  of  buffer  deadlock. 

4.5.2.2.  Buffer  Space:  Performance  Considerations 

Each  communication  component  must  provide  enough  buffer  space  to 
maintain  a  steady  flow  of  traffic  through  the  node.  Otherwise,  communication 
bandwidth  will  be  wasted:  In  the  send /acknowledge  flow  control  scheme, 
retransmissions  are  required  to  resend  rejected  packets,  while  in  the  remote 
buffer  management  scheme,  links  simply  become  idle.  How  many  buffers  are 
required  to  maintain  this  flow?  Studies  of  multistage  permutation  networks, 
called  delta  networks,  indicate  that  virtually  no  performance  improvement 
arises  beyond  three  buffers  per  node  [DiasBla,  DiasBlb].  Three  buffers  are  not 
sufficient  however,  to  avoid  many  deadlock  situations.  Thus,  based  on  these  stu¬ 
dies,  deadlock  rather  them  performance  optimization  should  be  used  to  deter¬ 
mine  the  amount  of  buffer  space  required. 

Simulation  experiments  were  carried  out  to  evaluate  the  following  ques¬ 
tions: 

(1)  Kow  many  buffers  should  each  component  provide  to  achieve  good  perfor¬ 
mance? 

(2)  How  many  buffers  should  each  virtual  circuit  be  allowed  to  use  at  one  time? 

(3)  How  well  does  the  simple  send/acknowledge  flow  control  mechanism  per¬ 
form? 


Before  discussing  the  results  of  these  simulation  experiments,  let  us  anticipate 
the  answers  to  these  questions  by  deriving  an  intuitive  understanding  of  the 
impact  of  buffering  and  flow  control  on  network  performance. 

Buffering  and  flow  control  questions  are  of  little  consequence  when  the  net¬ 
work  is  lightly  loaded,  since  buffering  requirements  are  low  (buffers  start  to 
empty  while  they  are  being  filled)  and  throttling  mechanisms  are  not  necessary. 
Therefore,  we  will  only  consider  the  case  in  which  links  are  congested.  An  exam¬ 
ple  of  a  congested  link  is  shown  in  figure  4.16a.  Two  circuits,  1  and  2,  arrive  at  a 
node  on  links  A  and  B  respectively,  and  share  link  C.  Assume  that  each  circuit 
carries  a  continuous  stream  of  packets.  Since  the  combined  bandwidths  of  links 
A  and  B  is  twice  that  of  C,  the  latter  becomes  the  bottleneck  which  limits  the 
performance  (Le.  bandwidth)  of  each  circuit. 

First,  let  us  consider  the  simplest  communication  component  design  in 
which  a  send/acknowledge  protocol  is  used  for  flow  control,  and  where  each 
channel  is  allowed  to  use  only  a  single  buffer  at  one  time.  Figure  4.16b  indicates 
the  utilization  of  the  communication  links  over  time.  The  contents  of  the  buffer 
held  by  each  circuit  is  also  shown.  Successive  packets  on  virtual  circuits  1  and  2 
are  labelled  la,  lb.  lc  ....  and  2a,  2b.  2c,  ...  respectively.  As  shown  in  figure 
4.16b,  the  congested  link,  C,  is  fully  utilized,  and  carries  packets  from  both  vir¬ 
tual  circuits.  The  end-to-end  bandwidth  of  the  two  virtual  circuits  will  be  equal 
to  half  the  bandwidth  of  the  C  link,  assuming  traffic  in  other  nodes  does  not  limit 
performance.  Note  chat  most  of  the  packets  sent  over  links  A  and  B  must  be 
retransmitted,  since  the  first  attempt  fails  because  of  the  "one  buffer  per  chan- 
neT’  restriction.  Yet,  these  retransmissions  do  not  waste  bandwidth  on  the 
bottleneck  link,  C,  so  the  end-to-end  bandwidth  is  not  affected.  However,  the 
negatively  acknowledged  packets  do  reduce  the  effective  bandwidth  of  other  cir¬ 
cuits  which  do  not  use  link  C,  but  which  are  instead  limited  by  the  bandwidth  of 


links  A  or  B  This  will  be  discussed  later. 

In  figure  4.16c,  the  send/acknowledge  protocol  is  replaced  by  a  remote 
buffer  management  scheme,  and  circuits  are  still  restricted  to  using  a  single 
buffer  at  one  time.  The  optimistic  assumption  is  made  that  each  node  has  "per¬ 
fect  information"  about  buffer  utilization  in  its  neighboring  nodes,  i.e.  the  time 
required  to  transmit  the  feedback  signal  indicating  a  packet  has  been  forwarded 
(and  consequently,  a  buffer  has  been  release)  is  assumed  to  be  negligible.  The 
flow  of  traffic  over  the  bottleneck  link,  C,  is  exactly  the  same  as  that  when  the 
simpler  send/acknowledge  protocol  is  used,  so  the  end-to-end  bandwidth  of  the 
two  virtual  circuits  remains  the  same.  In  comparing  figures  4.16b  and  4.16c,  it 
is  seen  that  the  negatively  acknowledged  packets  on  links  A  and  B  in  figure  4.16b 
are  replaced  by  idle  periods  in  figure  4. 16c. 

Finally,  in  figure  4.16d,  an  unlimited  amount  of  buffer  space  is  provided  in 
each  node.  No  flow  control  mechanism  is  required  since  there  are  no  buffer 
overflows,  and  therefore  no  reason  to  throttle  traffic.  Utilization  of  the 
bottleneck  link  is  the  same  as  that  of  the  previous  two  cases. 

Intuitively,  buffering  is  provided  in  each  node  to  achieve  higher  network 
bandwidth  by  maintaining  a  large  enough  "backlog"  of  traffic  in  the  node  so  that 
its  output  links  remain  busy  under  heavy  traffic  loads.  If  a  node  provides 
enough  buffers  to  keep  its  links  busy,  then  additional  buffers  do  not  improve 
performance.  As  demonstrated  in  figures  4.16b-d,  a  relatively  small  amount  of 
buffer  space  per  node  is  required  to  perform  this  function,  explaining  the  results 
observed  in  [DiasBla,  DiasBlb].  Protection  against  buffer  hogging  must  be  pro¬ 
vided  however,  e.g.  by  limiting  the  number  of  buffers  each  port  can  use.  to 
ensure  that  one  link  does  not  monopolize  the  buffer  pool  and  cause  other  link(s) 
to  become  idle. 


Figures  4. 16b-d  indicate  that  the  one  buffer  per  channel  restriction  does 
not  have  a  significant  impact  on  network  performance.  Performance  is  limited 
by  the  bandwidth  of  bottleneck  links,  rather  than  a  lack  of  buffer  space.  If  links 
are  underutilized,  then  each  channel  need  only  provide  a  single  buffer  to  main¬ 
tain  a  steady  flow  of  traffic,  as  discussed  earlier.  If  links  are  congested,  then 
there  is  enough  traffic  on  other  circuits  to  keep  the  links  busy,  so  no  bandwidth 
is  wasted.  Performance  is  not  improved  by  allowing  channels  to  use  additional 
buffers. 

These  studies  also  indicate  that  the  send /acknowledge  protocol  does  not 
adversely  affect  performance  on  bottleneck  links  (link  C  in  figure  4.16b).  This 
flow  control  mechanism  does  however,  require  retransmissions  on  the  links  lead¬ 
ing  up  to  the  bottleneck,  implying  some  wasted  bandwidth.  Often  however,  it  is 
the  bottleneck  link  which  limits  performance,  and  not  the  links  leading  up  to  the 
bottleneck,  so  this  wasted  bandwidth  is  of  secondary  importance.  In  addition, 
this  waste  is  only  of  consequence  if  there  is  other  traffic  waiting  to  use  the  link. 
If  other  traffic  exists,  then  the  amount  of  wasted  bandwidth  is  reduced,  since 
many  of  the  ’‘negative  acknowledgment  slots”  in  figure  4.16b  will  be  replaced  by 
traffic  on  other  virtual  circuits,  assuming  the  blocked  circuit  is  not  allowed  to 
dominate  use  of  the  link.  Thus,  a  more  sophisticated  flow  control  mechanism 
which  avoids  negatively  acknowledged  packets,  e.g.  the  remote  buffer  manage¬ 
ment  scheme  described  earlier,  leads  to  an  even  smaller  improvement  in  perfor¬ 
mance.  Since  network  bandwidth  can  be  improved  at  a  higher  level  by  using 
more  switching  chips,  the  additional  complexity  of  the  remote  buffer  manage¬ 
ment  scheme  is  difficult  to  justify.  In  addition,  output  port  buffer  hogging  is 
more  difficult  to  prevent  with  this  more  sophisticated  scheme  (section  4.1.3),  so 
performance  may  actually  be  reduced  if  this  scheme  is  used. 


Let  us  no-*'  examine  the  simulation  results  to  see  if  they  are  consistent  with 
the  discussion  presented  above.  Figures  4.17a-f  show  the  performance  resulting 
from  executing  the  six  application  programs  discussed  in  chapter  3  on  a  hexago¬ 
nal  lattice  network  built  from  4-port  communication  components.  All  curves  use 
a  send/acknowiedge  protocol  for  flow  control.  If  several  packets  are  queued, 
waiting  to  use  the  same  link,  a  round-robin  algorithm  is  used  to  select  the  next 
packet  to  be  «ent  over  the  link.  This  prevents  a  blocked  virtual  circuit  from 
dominating  use  of  the  link  by  continuously  retransmitting  negatively  ack¬ 
nowledged  packets. 

The  curves  in  figure  4.17  are  distinguished  by  the  amount  of  buffer  space 
provided  in  each  communication  component,  and  the  degree  to  which  buffer 
usage  is  restricted.  Four  combinations  of  these  two  parameters  result: 

(1)  Unlimited  buffer  space  and  no  restrictions  on  buffer  usage. 

(2)  Unlimited  buffer  space  but  with  restrictions  on  buffer  usage. 

(3)  Limited  buffer  space  and  no  restrictions  on  buffer  usage. 

(4)  Limited  buffer  space  but  with  restrictions  on  buffer  usage. 

As  discussed  earlier,  the  third  class  of  networks,  those  with  a  limited  amount  of 
buffer  space  and  no  restrictions  on  buffer  usage,  resulted  in  deadlock  situations 
for  many  of  the  application  programs.  Thus,  performance  of  the  application 
programs  on  these  networks  is  not  shown  in  figure  4. 17. 

All  networks  with  limited  amounts  of  buffer  space  provide  16  buffers  per 
communication  component.  These  networks  also  assume  that  each  output  port 
may  not  use  more  than  8  buffers  at  one  time,  or  16/  as  suggested  in  [Irla7B], 

The  curves  indicate  that  networks  with  16  buffers  per  component  yield  vir¬ 
tually  the  same  performance  as  networks  with  an  infinite  amount  of  buffer 
space,  in  agreement  with  the  discussion  presented  earlier.  Restricting  virtual 
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circuits  to  using  at  most  one  buffer  at  a  time  results  in  no  significant  degrada¬ 
tion  in  performance.  Networks  using  a  send/acknowledge  protocol  for  flow  con¬ 
trol  yield  virtually  the  same  performance  as  networks  with  an  infinite  amount  of 
buffer  space  and  no  restrictions  on  buffer  usage.  This  indicates  that  the 
bandwidth  wasted  by  negatively  acknowledged  packets  does  not  have  a 
significant  effect  on  performance.  This  is  due  in  part  to  the  round-robin  algo¬ 
rithm  used  for  scheduling  usage  of  the  communication  links.  Blocked  virtual 
circuits  relinquish  use  of  the  link  when  a  packet  is  rejected,  allowing  other 
traffic  to  use  the  link.  Thus,  the  simulation  results  agree  with  the  intuitive  argu¬ 
ments  presented  earlier. 

It  might  be  noted  that  many  of  the  programs  achieve  identical  speedups 
regardless  of  the  amount  of  buffering  provided,  or  the  buffer  restrictions 
enforced.  In  these  programs,  a  single,  or  a  few  isolated  bottleneck  links  limit 
performance,  e.g.  the  S1S0  programs  are  limited  by  the  links  carrying  the  initial 
data  samples.  This  is  in  agreement  with  the  scenarios  outlined  in  figures  4.16b- 


d,  where  identical  utilizations  are  achieved  on  bottleneck  links  regardless  of  the 
buffering  scheme  used. 

All  of  the  application  programs  send  small,  single-packet  messages.  To 
examine  buffering  requirements  when  large  messages  are  used,  the  artificial 
traffic  loed  program  was  modified  to  send  longer  messages  consisting  of  256 
bytes,  or  16  packets  each.  Figure  4.17g  shows  the  performance  of  this  program 
under  different  buffering  schemes.  The  curves  indicate  that  message  delays  are 
the  same  under  light  traffic  loads,  demonstrating  that  one  buffer  per  virtual  cir¬ 
cuit  is  adequate  to  maintain  a  steady  stream  of  packets  if  there  is  no  interfering 
traffic.  The  curves  also  indicate  however,  that  networks  achieve  somewhat 
higher  bandwidth  if  multiple  buffers  per  virtual  circuit  are  allowed.  As  buffering 
is  increased,  the  number  of  negatively  acknowledged  packets  on  circuits  leading 
up  to  congested  links  is  reduced,  and  bandwidth  improves.  This  additional  per¬ 
formance  must  be  weighed  against  the  added  complexity  of  allowing  multiple 
buffers  per  virtual  circuit  In  light  of  the  fact  that  network  bandwidth  can  be 
improved  by  increasing  the  number  of  communication  chips,  it  is  doubtful  that 
this  additional  improvement  is  justifiable.  In  addition,  the  additional  complexity 
of  allowing  a  circuit  to  use  more  than  one  buffer  (a  flfo  queue  on  each  channel  is 
required)  may  lead  to  longer  circuit  delays,  and  slower  clock  rates,  reducing 
performance. 

4.6.  Complexity  of  the  Communication  Component 

Using  the  results  presented  above,  the  complexity  of  the  VLSI  communica¬ 
tion  components  described  here  can  be  estimated.  It  is  assumed  that  each 
component  provides  from  64  to  256  channels  per  link,  32  buffers,  each  large 
enough  to  accommodate  18  bytes  of  data,  and  4  I/O  ports.  Transistor  counts  for 
different  versions  of  communication  components  providing  different  levels  of 
functionality  are  presented. 
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The  communication  component  design  described  here  consists  of  6 
modules: 

(1)  1/0  ports, 

(2)  routing  controller, 

(3)  translation  tables, 

(4)  packet  buffers, 

(5)  buffer  management  circuitry, 

(6)  flow  control  circuitry. 

Each  of  these  will  be  discussed  in  turn.  Estimates  of  the  amount  of  circuitry 
required  for  these  modules  are  summarized  in  table  4.2. 

Approximate  gate  counts  are  derived  in  part  from  the  designs  described  in 
the  TTL  Databook  [Inst76].  For  example,  the  five  2  line  to  1  line  multiplexers 
shown  in  figure  4.10b  are  assumed  to  require  18  gates,  based  on  an  extension  of 
the  corresponding  TTL  part,  the  SN74157  [Inst76].  Similarly,  registers  are 
assumed  to  use  5  gates  per  bit.  Data  paths  for  the  high  speed  buses  and  com¬ 
munication  links  are  8  bits  wide.  It  is  assumed  that  each  finite  state  machine 
requires  200  gates.  This  is  based  on  the  average  number  of  gates  required  for 
the  finite  state  machines  described  in  [Fuji80],  which  are  of  roughly  the  same 
complexity  as  those  described  here.  Finally,  the  transistor  counts  assume  four 
transistors  are  required  for  each  gate.  The  estimates  in  table  4.2  are  rounded 
to  the  nearest  1000  transistors. 

Estimates  for  the  I/O  ports  are  based  on  the  circuitry  described  in 
[Laur79].  The  output  port  estimate  includes  circuitry  for  removing  data  from 
the  high  speed  bus  and  driving  the  communication  links.  The  input  port  esti¬ 
mate  includes  circuitry  for  driving  the  high  speed  bus,  as  well  as  three  tem¬ 
porary  registers  to  handle  conflicts  in  accessing  the  shared  memory  modules,  as 
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Table  4.2 
Transistor  Counts 


module 

memory 

(kbits) 

84  channels 

128  channels 

256  channels 

8000 

- 

- 

Routing  controller 
simple 

9000 

0.6 

1.1 

2.1 

complex 

10000 

1.4 

1.9 

2.9 

Routing  Processor 

20000 

32.0 

32.0 

32.0 

Translation  Table 

- 

2.2 

5.0 

11.0 

Buffers  (16  modules) 

8000 

4.0 

4.0 

4.0 

Buffer  Management 

£l  buffer  /sc 

2000 

1.3 

2.5 

5.0 

^4  buffers /vc 

2000 

2.7 

5.2 

10.2 

Flow  Control 

Send  /  Acknowledge 

si  buffer /vc 

11000 

0.6 

0.9 

1.5 

£4  buffers/vc 

12000 

0.9 

1.4 

2.5 

Remote  Buffer 

£1  buffer  /sc 

13000 

1.0 

1.3 

2.0 

£4-  buffers/ vc 

14000 

1.2 

1.8 

3.0 

totals 

simplest 

58000 

40.7 

45.5 

55.6 

most  complex 

62000 

43.5 

49.9 

63.1 

discussed  earlier.  Based  on  this,  each  1/0  port  requires  roughly  2000  transis¬ 
tors,  or  8000  transistors  for  4  ports. 

The  routing  controller  estimates  are  based  on  the  design  described  in 
[Fuji80],  modified  to  include  circuitry  for  a  256  entry  hierarchical  routing  table 
supporting  up  to  8  levels  of  lookup  tables  (see  figures  4.4a  and  4.4b).  The 
numbers  in  table  4.2  only  include  the  hardware  support  provided  by  the  routing 
controller  for  setting  the  translation  tables,  and  do  not  include  the  processor  or 
microcode  memory  portions  of  the  routing  controller.  These  are  listed 
separately  in  the  table  under  "routing  processor". 

The  translation  table  requires  pxc  entries.  Each  entry  specifies  an  output 
port  or  the  routing  controller  (3  bits)  and  a  channel  number  (6,  7,  or  8  bits  for 
84,  128,  or  256  channels  respectively),  so  9,  10,  or  11  bits  are  required.  The 
buffer  memory  estimate  includes  circuitry  to  interface  each  memory  module  to 
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the  two  buses,  registers  to  hold  the  pipelined  buffer  addresses,  and  a  32  byte 
RAM  to  hold  data.  The  figure  in  table  4.2  includes  circuitry  for  sixteen  memory 
modules. 

Two  estimates  of  the  buffer  management  circuitry  are  shown  in  table  4.2. 
The  first  assumes  each  virtual  circuit  can  only  use  at  most  one  buffer  at  a  time, 
and  is  based  on  the  design  shown  in  figure  4.8b.  The  second  assumes  four 
buffers  can  be  used,  and  is  based  on  the  design  in  figure  4.8a. 

Four  estimates  of  the  flow  control  logic  are  shown.  The  first  two  use  the 
send/acknowledge  protocol,  and  the  latter  two  use  a  remote  buffer  manage¬ 
ment  scheme.  In  addition,  each  of  these  designs  allows  virtual  circuits  to  use 
either  one,  or  up  to  four  buffers  at  a  time.  The  estimate  for  the 
send/acknowledge  protocol  with  up  to  four  buffers  per  channel  is  based  on  the 
design  shown  in  figure  4.10a.  As  discussed  in  section  4.2. 4.3,  the  remote  buffer 
management  scheme  uses  the  same  circuitry,  as  well  as  the  additional  logic 
shown  in  figure  4.10b.  Modifications  corresponding  to  the  one  buffer  per  channel 
restriction  are  outlined  in  sections  4.2. 4.2  and  4.2. 4.3. 

The  figures  in  table  4.2  indicate  that  a  minimal  complexity  communication 
component  with  4  I/O  ports.  32  16-byte  buffers,  84  channels  per  link,  one  buffer 
per  virtual  circuit,  and  a  send /acknowledge  flow  control  scheme,  can  be  con¬ 
structed  with  approximately  58,000  transistors  for  logic,  and  40.7  kbits  of  RAM. 
Assuming  single  transistor  ROM  cells  for  microcode  memory  and  single  transis¬ 
tor  dynamic  RAM  cells,  approximately  100,000  transistors  are  required.  Exrept 
for  the  number  of  channels,  this  design  is  in  accordance  with  the  design  recom¬ 
mendations  derived  throughout  this  chapter.  A  similar  design  using  256  char¬ 
nels  per  link  and  the  more  complex  routing  scheme  requires  59,000  transistors 
of  logic,  and  56.4  kbits  of  RAM.  Assuming  a  static  RAM  implementation  using  5 
transistors  per  RAM  cell,  this  more  complex  design  requires  on  the  order  of 
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350,000  transistors,  most  of  which  is  taken  up  by  memory.  Chips  are  currently 
available  using  450,000  transistors,  so  the  communication  components 
described  here  can  be  implemented  with  current  integrated  circuit  technology. 

Figure  4.18  shows  a  possible  floorplan  for  the  64  channel  communication 
component  described  above.  Sizes  of  various  sections  of  the  chip  are  based  on 
the  approximate  transistor  counts  listed  in  table  4.2,  based  on  single  transistor 
RAM  and  ROM  cells.  Data  paths  correspond  to  those  shown  in  figures  4.6  and  4.7. 
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Figure  4.18.  Floorplan  for  communication  component. 


CHAPTER  FIVE 
CONCLUSIONS 


VLSI  technology  can  provide  us  with  a  novel  set  of  building  blocks  for  the 
construction  of  high-performance  point-to-point  networks  for  closely  coupled 
multicomputer  systems.  “Plug-compatible''  VLSI  communication  components 
with  3  to  5  ports  make  particularly  attractive  building  blocks.  Their  modularity 
permits  the  incremental  growth  of  a  multicomputer  system  with  a  correspond¬ 
ing  growth  of  the  total  bandwidth  of  the  communication  domain.  To  be  useful 
for  the  construction  of  systems  with  hundreds  or  thousands  of  processors,  the 
complexity  of  these  components  must  be  above  a  certain  threshold.  The  func¬ 
tionality  of  MOS  VLSI  chips  now  exceeds  this  threshold. 

Technological  considerations  in  the  design  of  communication  components 
have  been  examined.  The  described  approach  based  on  dedicated  links  between 
individual  switching  nodes  is  well  matched  to  the  evolving  VLSI  MOS  technology. 
The  overall  performance  of  the  network  depends  critically  on  the  total  chip 
bandwidth  of  these  components,  which  is  determined  to  a  large  degree  by  pack¬ 
aging  technology.  Performance  is  also  influenced  by  the  buffering  and  forward¬ 
ing  policies  employed,  which  depend  themselves  on  the  amount  of  buffer  space 
and  the  complexity  of  the  control  logic  in  the  switching  components.  Analytic 
and  simulation  models  have  been  used  to  investigate  the  impact  of  these  con¬ 
siderations  on  overall  network  performance.  Based  on  these  studies,  a  number 
of  conclusions  can  be  drawn  regarding  the  design  of  VLSI  communication  com¬ 
ponents.  These  include: 

(l)  A  small  number  of  ports  should  be  used,  say  from  3  to  5. 


(2)  Under  current  technology,  the  degree  of  multiplexing  on  each  communica¬ 
tion  link  should  be  relatively  large.  Each  link  should  provide  a  large 
number  of  channels,  say  128  or  256. 

(3)  0.  /  a  relatively  small  number  of  buffers,  say  16  or  32,  need  to  be  provided. 
Further,  restricting  virtual  circuits  to  using  at  most  one  buffer  per  node  at 
one  time  causes  little  performance  degradation. 

(4)  Negativeiy  acknowledged  packets  in  the  send/ acknowledge  flow  control 
mechanism  do  not  lead  to  a  significant  performance  degradation,  implying 
that  more  sophisticated  schemes,  such  as  the  sender-controlled  remote 
buffer  management  scheme,  are  not  justified. 

(5)  A  multicast  mechanism  has  a  significant  impact  on  performance  in  applica¬ 
tions  which  send  the  same  data  to  many  different  destination  processors. 

An  important  issue  that  has  not  been  addressed  by  this  thesis  concerns 
fault  tolerance.  If  a  communication  component  fails,  routing  tables  need  to  be 
updated,  and  broken  message  paths  must  to  be  restored.  The  rerouting  must  be 
done  in  a  manner  that  ensures  that  loops  are  not  introduced.  Much  of  the  work 
in  rerouting  strategies  in  computer  networks  is  directly  applicable  here  [Taji77, 
Merl79,  SegaSl].  One  must  also  safeguard  the  network  against  messages 
addressed  to  non-existent  or  unreachable  nodes.  These  issues  cannot  be 
ignored  in  any  communication  component  design. 

For  the  near  future,  the  limited  number  of  devices  that  can  be  fabricated 
economically  on  a  single  chip  will  encourage  the  development  of  separate 
switching  components.  However,  towards  the  end  of  this  decade,  the  preferred 
building  block  may  well  consist  of  a  powerful  processor,  a  substantial  amount  of 
on-chip  memory,  and  the  switching  circuitry  that  is  needed  so  that  these  com¬ 
ponents  can  be  readily  plugged  together  into  a  working  multicomputer  system. 
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ABSTRACT 


This  document  describes  a  simulator  for  modelling  execution 
of  parallel  programs  on  a  multiprocessor.  The  simulator  executes 
a  set  of  programs  as  if  each  were  run  on  a  separate  processor,  and 
compiles  statistics  for  the  entire  run.  A  portion  of  the  simulator, 
known  as  the  "switch  model",  simulates  the  exchange  of  messages 
among  the  processors  through  an  interconnection  network.  The 
simulator  was  developed  to  allow  different  types  of  interconnection 
hardware  (e.  g.  packet  switches,  crossbars,  etc.)  to  be  modelled  by 
simply  "plugging  in"  the  appropriate  switch  model. 

Applications  are  programmed  as  a  set  of  communicating  tasks 
(processes).  The  interface  seen  by  the  applications  programmer  is 
discussed,  and  in  particular,  the  communications  mechanism  is 
described  in  detail.  Examples  are  given.  Finally,  the  implementa¬ 
tion  of  Simon  is  described. 
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1.  INTRODUCTION 


This  document  describes  a  simulation  program  called  Simon  (simulator  of 
multicoputer  networks).  It  is  assumed  that  the  reader  has  at  least  a  reading 
knowledge  of  the  C  programming  language  [Kern79].  All  applications  programs 
writen  for  Simon  should  be  written  in  C. 

1. 1.  What  is  Simon? 

Simon  is  a  deterministic  event  driven  simulation  program  which  models  the 
execution  of  a  parallel  application  program  on  a  multiprocessor  computing  sys¬ 
tem.  The  application  program,  provided  by  the  user,  consists  of  a  number  of 
sequential  programs  which  execute  concurrently.  Simon  executes  each  sequen¬ 
tial  program  as  if  it  were  run  on  a  seperate  processor.  Message  transmissions 
between  the  processors  are  simulated,  and  statistics  concerning  the  program's 
dynamic  behavior  (e.g.  execution  time,  time  spent  waiting  for  data,  etc.)  are 
reported. 

Simon  consists  of  three  distinct  modules: 

(1)  The  application  program. 

(2)  The  simulator  base. 

(3)  The  switch  model. 

Each  of  these  will  now  be  described  in  turn. 

The  first  component,  the  application  program,  consists  of  a  number  of  user 
defined  "tasks"  written  in  C.  A  task  is  defined  as  a  sequential  program,  and  the 
data  it  uses,  executing  on  a  processor.  The  tasks  execute  asynchronously,  and 
communicate  by  exchanging  messages.  Intertask  communication  is  made  possi¬ 
ble  by  routines  provided  by  Simon  for  sending  and  receiving  messages.  No 
shared  memory  is  allowed,  i.e.  no  two  tasks  can  directly  access  the  same 
memory  location.  A  task  is  conceptually  identical  to  a  process,  however  this 
term  is  reserved  to  refer  to  a  Simon  program  executing  on  a  host  computer. 

The  second  component,  the  simulator  base,  time  multiplexes  execution  of 
the  tasks  on  the  host  computer,  in  this  case  a  VAX.  The  base  also  collects  stat  s- 
tics  on  the  application  program  and  outputs  them  when  execution  completes. 


The  third  component,  the  switch  model,  simulates  the  transmission  of  mes¬ 
sages  among  the  processors.  The  switch  model  might  be,  to  mention  a  few,  a 
crossbar,  an  Ethernet,  or  a  store-and-forward  communications  network.  By 
seperating  the  model  for  the  interconnection  switch  into  a  separate  module, 
alternate  switching  structures  can  be  compared  by  just  "plugging  in"  the 
appropriate  switch  models. 

1.2.  Who  Should  Uae  Simon? 

Simon  was  written  with  two  types  of  users  in  mind: 

(1)  Persons  interested  in  developing  and  analyzing  parallel  application  pro¬ 
grams. 

(2)  Persons  interested  in  analyzing  the  performance  of  various  interconnection 
schemes  under  loads  generated  by  real  application  programs. 

In  the  first  case,  applications  such  as  circuit  simulation  and  voice  recognition 
are  evisioned,  although  Simon  is  by  no  means  restricted  to  these  applications. 
In  the  second  case,  Simon  provides  an  alternative  to  simulation  models  based  on 
loads  created  artificially  by  random  number  generators. 

1.3.  Restrictions 

In  order  to  reduce  the  complexity  and  overhead  of  running  Simon,  several 
restrictions  have  been  made.  First,  this  implementation  assumes  that  each  task 
runs  on  its  own  processor.  Thus,  two  distinct  task  may  not  execute  on  the  same 
processor.  The  maximum  number  of  processors  (tasks)  however,  is  virtually 
unlimited,  subject  only  to  memory  constraints  on  the  computer  running  Simon. 
Further,  it  is  assumed  that  tasks  and  communications  among  tasks  are  stati¬ 
cally  defined,  i.  e.  all  tasks  must  be  created  before  any  can  begin  execution,  and 
each  task  must  initially  specify  all  other  tasks  with  which  it  may  send  or  receive 
messages. 


1.4.  About  this  Document 
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This  document  is  intended  for  those  who  are  interested  in  writing  parallel 
application  programs  for  Simon,  or  for  those  interested  in  understanding  its 
internal  structure  and  implementation.  For  the  former,  mechanisms  for  creat¬ 
ing  tasks  are  defined,  as  well  as  intertask  communications  facilities.  For  the 
latter,  an  overview  of  the  current  implementation  of  Simon  for  a  VAX-11/780  will 
be  discussed. 

The  remainder  of  this  document  consists  of  six  sections,  and  a  number  of 
appendices.  The  next  two  sections  describe  what  the  user  needs  to  know  to 
write  application  programs.  Section  2  describes  the  routines  provided  for  creat¬ 
ing  and  executing  tasks,  and  section  3  describes  routines  for  exchanging  mes¬ 
sages.  Section  4  gives  some  examples  using  the  routines  defined  in  the  previous 
two  sections.  Section  5  describes  the  timing  model  which  is  used  for  gathering 
statistics  on  the  task.  Section  6  gives  some  guidelines  for  writing  programs  for 
Simon.  These  guidelines  are  provided  to  avoid  some  rather  subtle  bugs  in  appli¬ 
cations  programs  which  Simon  cannot  detect.  Finally,  section  7  describes  the 
internal  organization  of  the  Simon  program. 

A  summary  of  all  of  the  routines  discussed  in  this  document  is  given  in 
Appendix  I.  Other  appendices  describe  the  mechanics  of  using  Simon  and  the 
special  C  compiler  necessary  for  the  timing  model. 


Z  TASKS 


An  application  program  can  view  SIMON  as  a  set  of  subroutines  for  perform¬ 
ing  various  functions,  much  like  a  C  program  can  view  an  operating  system  as  a 
set  of  subroutines  for,  e.  g.,  printing  results  or  reading  data  from  a  file.  This 
section  and  the  next  describe  the  functions  provided  by  SIMON  as  well  as  the 
routines  which  implement  them. 

The  application  program  executing  on  the  multiprocessor  is  a  set  of  con¬ 
currently  executing  "tasks"  (or  equivalently,  processes).  A  task  can  be  thought 
of  as  an  instance  (L  e.  a  loaded  and  executing  core  image)  of  a  C  program.  A 
program,  like  that  in  a  conventional  uniprocessor,  consists  of  a  set  of  routines 
which  call  each  other  in  some  arbitrary  fashion.  One  routine  in  the  program  for 
a  task  acts  as  the  "main  procedure",  and  is  called  when  the  task  first  begins  exe¬ 
cution.  This  routine  should  not  be  called  "main”. 

Several  tasks  may  be  created  from  a  single  program.  To  the  user,  this  is 
equivalent  to  generating  multiple  copies  of  the  program,  and  creating  a  single 
task  for  each  one.  In  reality  however,  SIMON  shares  a  single  copy  of  the  code 
among  the  tasks,  and  creates  a  separate  data  area  for  each. 

This  section  describes  the  routines  provided  by  SIMON  for  creating  and  exe¬ 
cuting  tasks.  First,  task  creation  via  the  "mktaskO”  routine  is  described.  A 
task  cannot  be  created  however,  unless  there  already  exists  another  task  to 
create  it,  presenting  a  "chicken  and  egg"  problem.  To  solve  this  problem,  a  rou¬ 
tine  called  "bootstrapO"  is  provided  for  getting  the  simulator  started.  This  is 
discussed  in  the  following  subsection.  If  multiple  tasks  are  created  from  a  single 
program  and  that  program  uses  static  or  global  variables,  a  routine,  called 
"init()“,  must  be  used.  This  is  the  subject  of  the  third  subsection. 

2.1.  Creating  Tasks:  The  HktaslcO  Routine 

SIMON  provides  a  routine  called  mktask()  for  creating  tasks.  MktaskO  per¬ 
forms  several  functions.  In  addition  to  "creating"  the  task,  mktask()  assigns  the 
task  a  user  provided  name  and  task  "id”.  Every  task  has  a  unique  id  which  is 
used  internally  to  distinguish  it  from  other  tasks  (note  that  task  names  need  not 


be  unique).  This  id  may  also  be  used  in  conjunction  with  some  switch  models  to 
assign  certain  tasks  to  execute  on  specific  processors.  Space  for  maintaining 
information  necessary  for  the  task  to  execute  (e.  g.  stats  information)  is  also 
allocated,  as  well  as  space  for  collecting  statistics.  Mktask()  is  also  used  to  iden¬ 
tify  the  task’s  main  procedure.  When  the  task  first  begins  execution,  this  pro¬ 
cedure  is  called  via  an  ordinary  subroutine  call.  Finally,  mktaskQ  allows  the 
user  to  specify  parameters  which  are  passed  to  the  created  task  when  it  begins 
execution. 

MktaskQ  takes  five  parameters,  and  is  defined  as  follows: 

wiirt^A  (name,  cdptr,  id.  parm,  Ingth). 

where: 

name  =  a  character  string  specifying  the  name  of  the  task. 

cdptr  =  a  pointer  to  the  main  procedure  for  the  task. 

id  =  a  positive  integer  specifying  the  id  to  be  assigned  to  the  task. 

parm  =  a  pointer  to  a  parameter  list 

ln£th  sr  the  length  of  this  parameter  list  in  bytes. 

It  is  an  error  to  assign  two  tasks  to  the  same  id,  or  to  assign  a  task  to  id  0. 
It  will  be  seen  later  that  task  id  0  is  reserved  for  the  "bootstrap"  task. 

The  last  two  parameters  are  used  to  copy  the  parameter  list  into  an  inter¬ 
nal  buffer.  A  pointer  to  this  buffer  is  passed  to  the  task  when  it  is  first  called.  If 
the  size  of  the  parameter  list  is  specified  as  zero,  mktask()  assumes  no  parame¬ 
ters  are  to  be  passed  to  the  task.  The  parameter  pointer  must  be  NULL  in  this 
case,  or  an  error  will  result. 

Thus,  the  main  procedure  of  each  task  must  have  at  most  one  parameter,  a 
pointer  to  parameter(s)  used  by  that  task.  If  a  task  requires  more  than  one 
parameter,  a  structure  should  be  created,  and  a  pointer  to  this  structure 
passed  to  mktask(). 

The  parameter  list  passed  to  a  task  should  not  contain  pointers,  i.  e.  all 
parameters  passed  to  the  task  should  physically  reside  in  the  block  of  memory 
specified  by  the  last  two  parameters  passed  to  mktaskQ.  MktaskQ  copies  the 


specified  data  without  interpretation,  so  any  data  stored  at  the  address 
specified  by  the  pointer  may  have  been  changed  by  the  time  the  task  begins  to 
execute.  Passing  pointers  in  the  parameter  list  can  lead  to  rather  obscure  bugs. 
An  example  of  this  will  be  seen  later. 

To  illustrate  the  use  of  mktaskQ,  a  few  examples  will  now  be  given.  The  sim¬ 
plest  form  of  mktask()  is  shown  below: 
bootstrap() 

i 

int  foo(): 

mktask  ("foo",  foo,  5,  NULL,  0); 
tooQ 

code  for  task  foo 

i 

This  creates  a  task  called  "foo"  which  is  assigned  to  task  id  5.  The  main  pro¬ 
cedure  of  the  task  is  also  called  foo.  Foo  is  not  passed  any  parameters. 
Bootstrap()  is  a  task  already  in  existence,  and  will  be  discussed  in  the  next  sub¬ 
section. 

A  second  example  demonstrating  the  passing  of  a  single  parameter  to  a 

task  is  given  below: 
bootstrap() 

int  fle(); 
int  i; 

i  =  20; 

mktask  ("fie",  fie,  8,  &i,  size  of  (int)); 

I 

fle(p) 
int  *p; 

|>rintf  ("Parameter  passed  is  5Sd\n",  *p); 

Here,  task  fie  is  created  and  assigned  to  task  id  6.  When  executed,  the  line 
"Parameter  passed  is  20"  will  be  displayed  on  the  terminal  screen.  Note  the  use 
of  the  sizeofQ  routine  to  get  the  size,  in  bytes,  of  a  C  object. 


Finally,  a  third  example  demonstrates  the  passing  of  several  parameters  to 

two  different  task 'via  structures. 

struct  parm  /•  format  of  parameters  •/ 

int  a£lO];  /•  Pass  an  array  •/ 
int  n;  /•  number  of  elements  •/ 

i; 


bootstrap  () 
intT(); 

struct  parm  p; 
int  i.  id* 

id  =  1;  /•  id  assigned  to  tasks  •/ 

p.n  =  10; 

for  (i=0;  i<p.n;  i++)  p.a[i]  =  i; 

mktask  ("T".  T,  id++,  &p,  sizeof  (struct  parm)); 

for  (i=0;  i<p.n;  i++)  p.a[p.n-l-i]  =  i; 

mktask  ('T*.  T,  id++.  &p,  sizeof  (struct  parm)); 


T  (p) 

struct  parm  *p; 
inti; 

for  (i=0:  i<  p>>n;  i++) 

printf  ("%d\n",  p->a[i]); 


Two  tasks  are  created  from  the  program  T.  The  first  is  passed  an  array  of 
integers  in  ascending  order,  and  the  second  is  passed  an  array  in  descending 
order.  Note  that  the  structure  parm  contains  no  pointers.  Suppose  we  decided 
to  pass  the  tasks  a  pointer  to  the  data  rather  than  the  data  itself.  5ootstrap() 

might  then  be  defined  as  follows: 

struct  parm  /•  format  of  parameters  •/ 

int  *a;  /•  Pass  a  pointer!!  •/ 

int  n;  /•  number  of  elements  •/ 


bootstrapO 


T.TT1 
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int  T(); 

struct  parm  p; 

ini  i.  id,  buf[lO]; 

id  =  1; 

p.n  =  10; 

p.a  =  buf; 

far  (i=0;  i<p.n;  i++)  p.afi]  =  i; 

mktask  ("T",  T,  id++,  Jep,  sizeof  (struct  parm)); 

for  (i=0;  i<p.n;  i++)  p.a[p.n-l-i]  -  i; 

mktask  ("T",  T,  id++,  Jcp,  sizeof  (struct  parm)); 

\ 

There  are  two  reasons  why  this  program  malfunctions.  First,  buf  is  declared  as 
an  automatic  (Le.  non-static  local)  variable,  and  is  thus  kept  on  the  runtime 
stack  for  bootstrap().  The  pointer  passed  to  the  tasks  is  thus  a  pointer  into  the 
runtime  stack!  When  bootstrap ()  returns,  its  stack  is  returned  to  the  system, 
and  subsequently  overwritten  by  other  procedures.  Thus,  the  created  tasks 
receive  garbled  data.  However,  even  if  buf  were  declared  as  a  global  variable 
(and  thus,  not  kept  on  the  stack),  there  is  still  another  problem.  The  first  task 
is  passed  only  a  pointer  to  the  data,  and  not  the  data  itself.  The  second  "for 
loop"  in  boots  trap  ()  modifies  this  data,  and  so  the  result  is  both  tasks  receive 
the  array  in  descending  order.  In  the  previous  (correct)  example,  the  actual 
data  was  passed  via  mktaskQ  to  the  newly  created  task.  Since  mktaskO  creates 
a  separate  copy  of  this  data,  it  cannot  be  changed  by  any  user  program.  For 
this  reason,  it  is  recommended  that  structures  passed  to  tasks  as  parameters 
contain  no  pointers  fields. 

2.2.  Getting  Started:  The  BootstrapO  Routine 

In  order  for  the  user  to  create  and  execute  tasks,  he  must  execute 
mktask().  In  order  to  execute  mktask(),  he  must  have  created  and  executed  a 
task.  To  end  this  cycle,  SIMON  assumes  the  preexistence  of  a  user-defined  task 
called  bootstrap().  BootstrapO  is  passed  no  parameters,  and  is  arbitrarily 
assigned  to  task  id  0. 

BootstrapO  uses  mktask()  to  create  the  user  tasks,  and  then  returns  con¬ 
trol  back  to  the  simulator,  never  to  be  executed  again.  It’s  sole  purpose  is  to 
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create  the  set  of  tasks  the  user  wishes  to  execute.  BootstrapQ  is  not  allowed  to 
communicate  with  any  other  task.  All  tasks,  including  bootstrapQ,  are  created 
and  begin  execution,  at  time  0,  regardless  of  when  mktaskQ  was  executed. 
Thus,  in  this  respect,  bootstrap  is  a  somewhat  artificial  task  used  only  to  get  the 
simulation  started.  Within  SIMON.  bootstrap()  is  treated  as  any  ordinary  task 
however. 

The  first  implementation  of  SIMON  assumes  that  tasks  cannot  be  created 
dynamically.  All  tasks  must  be  created  by  bootstrap().  Thus,  it  is  an  error  for 
any  task  other  than  bootstrap()  to  execute  the  mktaskQ  routine. 

2.3.  Multiple  Instances  of  Tasks;  The  InitQ  Routine 

In  sequential  programming,  one  often  writes  a  subroutine  to  perform  some 
function  independently  of  the  specific  data  it  is  to  operate  on.  This  function  is 
then  called  from  various  points  in  the  program,  each  time  specifying  the  data 
via  parameters.  Thus,  the  algorithm  remains  the  same,  but  the  data  varies  from 
call  to  call.  Each  call  to  the  subroutine  creates  an  "instance”  of  that  subroutine. 
Local  variables  are  created,  actual  parameters  (those  specified  in  the  function 
call)  are  mapped  to  formal  parameters  (those  specified  in  the  subroutine),  etc. 
Since  execution  is  sequential,  at  most  one  instance  of  the  subroutine  exists  at 
any  given  time,  ignoring  recursion. 

Similarly,  it  is  useful  to  write  one  program  and  then  create  many  tasks 
from  this  program  which  only  differ  in  the  data  being  operated  on.  Here  how¬ 
ever.  the  various  tasks  execute  concurrently,  and  thus  exist  simultaneously. 
Just  as  one  parameterizes  instances  of  a  subroutine,  one  must  also  parameter¬ 
ize  tasks  (as  described  above  in  the  mktask()  routine).  This  is  useful  for  exam¬ 
ple.  in  array  operations  where  each  task  performs  the  same  computation,  but  on 
different  parts  of  the  array. 

When  multiple  instances  of  a  task  are  created  from  a  program  using  non¬ 
automatic  (i.  e.  static  or  global)  variables,  the  simulator  defined  routine  imt() 
must  be  used.  InitQ  tells  SIMON  where  these  variables  reside.  SIMON  must  have 
this  information  so  that  these  variables .  can  be  saved  and  restored  when 
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execution  switches  among  tasks  created  from  the  same  program.  Automatic 
variables  are  stored  on  the  runtime  stack  of  the  executing  program,  and  are 
saved  and  restored  by  SIMON  when  necessary,  transparent  to  the  user. 

Non-automatic  variables  should  all  be  stored  in  a  single  block  of  memory 
which  will  be  referred  to  here  as  the  "global  data  area".  InitQ  has  two  parame¬ 
ters,  and  is  defined  as  follows: 

init  (glob,  lngth); 

where: 

glob  =  a  pointer  to  the  start  of  the  global  data  area, 
lngth  =  the  length  of  this  area  in  bytes. 

InitQ  need  only  be  used  in  a  task  if  multiple  instances  are  created.  It  must  be 
executed  exactly  once  by  such  tasks,  and  must  execute  before  any  other  simula¬ 
tor  defined  routine  is  executed. 

To  use  init().  the  user  should: 

(1)  place  ALL  of  the  non-automatic  variables  for  the  task  into  a  structure. 

(2)  call  init()  using  a  pointer  to  this  structure,  and  a  size  gotten  from  sizeofQ. 

In  addition,  it  is  recommended  that  arrays  be  created  dynamically  via  one 
of  the  storage  allocation  routines  (e.  g.  callocQ)  provided  in  C.  This  will  reduce 
the  execution  time  of  the  simulator,  since  variables  created  by  (say)  calloc() 
need  not  be  saved  amd  restored  when  execution  switches  from  one  task  to 
another.  This  is  because  each  task  (even  those  created  from  the  same  program) 
dynamically  creates  it’s  own  data  area  to  which  it  has  sole  access.  In  other 
instances,  this  recommendation  becomes  a  requirement,  since  each  task  is  allo¬ 
cated  a  fixed  size  area  for  saving  its  stack.  If  an  attempt  is  made  to  save  more 
data  than  this  area  can  hold,  an  execution  error  results.  Data  areas  created  by 
callocQ  are  not  kept  on  the  stack. 

In  the  discussion  above,  it  is  assumed  that  non-automatic  variables  are 
used  to  share  data  between  procedures  of  the  same  task.  Different  tasks  must 
not  use  global  variables  to  exchange  information,  as  this  compromises  the  simu¬ 
lation  by  performing  communications  between  tasks  without  using  the  switch 
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model. 


The  following  is  an  example  of  a  program  from  which  several  tasks  are 
created.  Note  the  use  of  callocQ  for  creating  arrays,  and  the  manner  in  which 
non-automatic  variables  are  declared  so  their  location  can  be  passed  to  SIMON 
via  init(). 

static  struct  gstruct  /•  Global  data  area  •/ 

int  a.  b,  c;  /•  Some  scalars  •/ 

int  *g  arr:  /•  A  global  array  */ 

j  glob; 


foo  ()  /*  Program  for  a  task  •/ 

int  i,  j;  /•  Automatic  scalar  •/ 

int  *a _jLrr;  /*  Automatic  array  •/ 

char  *calloc();  /•  Storage  allocator  •/ 

/•  Define  global  data  area  •/ 
init  (icglob,  sizeof  (struct  gstruct)); 

/•  Create  automatic  variable  array  •/ 
ak_prr  =  (int  •)  c alloc  (100,  sizeof  (int)); 

/•  Create  global  array  •/ 

clob.g  arr  =  (int  *)  calloc  (20.  sizeof  (char)); 

/*  Reference  globals  and  automatic  locals  •/ 
i  =  20;  i  =  30;  /*  Automatic  scalar  •/ 

a_jprr[jj  =  30;  /•  Automatic  array  */ 

glob.a  =  20;  /•  Global  scalar  */ 

glob.g  airTi]  =  5;  /*  Global  array  •/ 
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2.4.  Other  Routines 

Finally,  two  other  routines  pertaining  to  tasks  will  be  mentioned.  These  are 
task(),  which  returns  a  task's  id,  and  taskname(),  which  returns  a  pointer  to  the 
task's  name  (the  character  string  passed  to  mktask()  when  the  task  was 
created).  The  character  string  returned  must  not  be  overwritten,  as  it  is 
SIMON’s  only  copy. 
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The  task()  routine  takes  no  parameters,  and  returns  an  integer  value. 
Taskname()  takes  a  single  parameter,  the  id  of  the  task  in  question,  and  returns 
a  pointer  to  a  character  string.  Thus,  a  task  can  determine  it’s  own  name  by 
calling: 

taskname  (task()); 

The  information  passed  through  a  task’s  name  or  its  id  may  be  used  for  further 
parameterization  of  the  task. 
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3.  INTERTASK  COMMUNICATIONS 

This  section  describes  the  mechanisms  for  allowing  tasks  to  send/receive 
messages  to/from  other  tasks.  Here,  two  types  of  communications  mechanisms 
are  defined: 

(1)  specification  of  which  tasks  communicate  with  which  other  tasks. 

(2)  sending  and  receiving  messages. 

The  first  mechanism  uses  "fifo's"  whose  names  are  globally  known  throughout 
the  multiprocessor.  Each  task  has  a  set  of  “export"  and  a  set  of  "import"  fifo’s 
for  sending  and  receiving  messages  respectively.  A  task  sends  a  message  by 
putting  it  into  one  of  his  globally  known  export  fifo’s.  A  message  is  received  by 
taking  it  out  of  an  import  fifo.  The  details  of  how  the  data  is  transported  from 
an  export  to  an  import  fifo  is  left  up  to  the  switch  model,  and  will  not  be  dis¬ 
cussed  here. 

The  next  subsection  discusses  the  basic  communications  mechanism  in 
greater  detail.  Following  this,  the  subroutines  implementing  each  of  the  two 
types  of  communications  mechanisms  defined  above  will  be  described.  A  con¬ 
cise  definition  of  the  routines  defined  in  this  section  is  made  in  Appendix  I. 

3.1.  Overview  of  the  Communications  Mechanisms 

Data  may  be  transmitted  to  or  received  from  other  tasks  via  fifo’s.  Each 
task  creates  a  set  of  fifo’s  which  act  as  the  interface  between  it  and  any  tasks  it 
communicates  with. 

Fifo’s  hold  messages  which  are  being  sent  to  or  have  arrived  from  other 
tasks.  Thus,  elements  (i.  e.  messages)  of  the  same  fifo  may  vary  in  size.  The 
contents  of  messages  are  not  interpreted  by  the  simulator. 

Each  fifo  has  a  user  defined  name  (an  arbitrary  character  string)  which  is 
globally  known  by  all  other  tasks.  Furthermore,  fifo’s  are  classified  as  being 
either  "export"  or  "import”.  When  a  task  creates  an  export  fifo  of  some  name 
(say  X)  it  is  given  the  capability  to  "export"  data  (i.  e.  send  messages)  to  all 
tasks  which  have  an  import  fifo  called  X. 


A  task  may  have  both  an  export  and  an  import  fifo  called  X,  but  it  cannot 
have  more  than  one  import  or  export  fifo  of  the  same  name.  Several  tasks  may 
each  have  an  import  fifo  called  X,  allowing  one  to  easily  broadcast  data  to 
several  tasks.  Similarly,  several  tasks  may  have  export  flfo’s  of  the  same  name, 
allowing  several  tasks  to  send  data  to  one  (or  more)  import  flfo(s).  Thus,  there 
are  no  restrictions  on  creating  fifo's  other  than  creating  more  than  one 
import/export  fifo  of  the  same  name  within  a  single  task. 

Thus,  communication  paths  among  tasks  are  defined  by  the  names  of  the 
flfo’s  created  within  each  task.  When  a  task  wants  to  send  a  message  to  another 
task,  it  puts  the  data  into  one  of  its  export  fifo's.  The  switch  model  removes  the 
data  from  the  export  fifo,  replicates  it  as  many  times  as  necessary,  and  places  a 
copy  in  each  import  fifo  of  the  same  name  as  the  export  fifo  the  data  was  origi¬ 
nally  put  in.  All  of  this  is  transparent  to  the  programmer.  The  receiver  now  only 
needs  to  get  the  data  out  of  its  import  fifo.  Two  tasks  must  have  a  commonly 
named  fifo  to  exchange  data  directly. 

Sequentiality  is  preserved  between  any  pair  of  export  and  import  fifo’s  of 
the  same  name.  If  two  messages  are  placed  in  export  fifo  "X”,  one  after  the 
other,  the  simulator  will  guarantee  that  these  messages  arrive  in  all  import  fifo's 
named  "X"  in  the  same  order  in  which  they  were  sent  In  network  terminology, 
every  pair  of  identically  named  export  and  import  fifo's  form  a  "virtual  circuit". 

The  next  subsection  describes  creation  of  export  and  import  flfo’s.  The  fol¬ 
lowing  subsection  describes  the  routines  provided  for  putting  (getting)  data  into 
export  (from  import)  flfo’s. 

3.2.  Fifo  Creation 

This  subsection  describes  the  simulator  defined  routines  for  creating  fifo’s. 
Export()  and  import()  are  the  basic  routines  used  for  creating  export  and 
import  flfo’s  respectively.  Also,  arrays  of  fifo’s  can  be  created  using  the  exparr() 
and  imparrO  routines.  Two  additional  routines  called  cnvarr()  and  cnvarrsv() 
are  provided  to  "index”  into  an  array  of  flfo’s. 


When  a  task  begins  executing  for  the  first  time,  it  must  create  all  of  the 
fifo's  it  will  need  for  the  entire  simulation  run.  No  dynamic  fifo  creation  is 
allowed.  After  this  initial  execution  period  (which  ends  when  the  task  gives  up 
the  cpu  to  allow  other  tasks  to  execute)  the  task  must  not  attempt  to  create  any 
new  fifo’s.  Any  attempt  to  do  so  will  be  flagged  as  an  error. 

&2.1.  Export© 

The  export()  function  creates  a  single  export  fifo.  It  takes  a  single  parame¬ 
ter,  a  character  string  (or  more  accurately,  a  pointer  to  a  character  string) 
specifying  the  name  of  the  fifo.  When  export()  is  called,  the  simulator  creates  an 
export  fifo  of  this  name,  allowing  the  task  to  send  data  to  all  other  tasks  which 

have  an  import  fifo  of  the  same  name.  Thus,  the  statement: 

export  ("X"); 

creates  an  export  fifo  called  X. 

3.2.2.  ImportO 

Import ()  is  very  similar  to  export().  Import()  however,  takes  two  parame¬ 
ters.  In  addition  to  the  name  parameter,  it  requires  a  parameter  specifying  the 
maximum  number  of  messages  the  import  fifo  can  hold.  If  a  0  is  specified,  the 
length  of  the  fifo  is  not  bounded  (this  is  the  most  commonly  used  case).  Fifo’s 
with  a  maximum  length  specified  are  called  bounded  fifo's,  and  are  useful  for 
applications  where  a  task  receiving  data  is  not  interested  in  all  the  data  sent  to 
it,  but  only  the  last  (say)  3  values. 

When  import()  is  called,  the  simulator  creates  an  import  fifo  of  the  specified 
name.  This  allows  the  task  to  receive  data  sent  by  other  tasks  with  export  fifo's 
of  the  same  name.  Data  arriving  into  a  bounded  flfo  which  is  full  causes  the  old¬ 
est  data  (at  the  front  of  the  fifo)  to  be  discarded,  the  new  data  to  be  placed  at 
the  end  of  the  flfo,  and  a  flag  (readable  by  overflQ  defined  below)  to  be  set.  Note 

that  only  import  fifo's  have  bounded  length.  For  example, 

import  ("Y",  0): 
import  ("Z",  3); 

creates  two  import  fifo's.  The  flfo  called  Y  is  of  unlimited  length,  but  the  flfo 


called  Z  will  only  hold  up  to  3  messages. 

3.2.3.  KxparrO.  Impair Q.  CnvarrO.  and  CnvarrsvO 

These  routines  are  used  for  creating  and  accessing  arrays  of  flfo’s.  An 
"array  of  flfo’s"  is  simply  a  set  of  flfo’s,  all  of  which  are  either  export  or  import 
flfo’s,  which  follow  a  certain  naming  convention.  The  convention  for  naming  indi¬ 
vidual  flfo’s  in  an  array  is  to  concatenate  a  "base  name”  with  a  subscript 
enclosed  by  brackets  ([]).  Thus  the  name  of  a  flfo  which  is  part  of  an  array  looks 
like  a  C  array  variable.  Subscripts  start  at  0. 

Exparr()  creates  an  array  of  export,  and  imparr()  an  array  of  import  flfo's. 
Exparr()  takes  two  parameters,  one  specifying  the  base  name  to  be  used  in  nam¬ 
ing  tbe  flfo’s,  and  the  other  specifying  the  length  of  the  array.  Imparr()  takes 
these  two  parameters  plus  a  third  giving  the  maximum  length  of  the  flfo's,  or  0  if 
their  length  is  not  limited.  All  import  flfo’s  of  an  array  must  have  the  same  max¬ 
imum  length.  Exparr()  and  imparr()  use  exportQ  and  import()  respectively  to 
create  flfo's.  They  could  easily  be  written  by  the  user,  but  are  provided  as  a 
convenience  and  to  form  a  convention  for  naming  individual  flfo’s  in  a  FIFO 
array. 

For  example, 

exparr  ("X",  4); 

creates  four  export  flfo’s  (with  base  name  X)  called  "X(0]’‘,  ”X(l]”.  "X(2]’’.  and 
"X[3]".  Note  that  no  blanks  are  automatically  inserted  into  the  flfo  name. 
Remember  that  flfo's  which  are  members  of  arrays  are  identical  to  ordinary 
flfo’s.  The  only  difference  is  that  they  follow  this  naming  convention.  A  task 
could  create  a  single  flfo  corresponding  to  one  element  of  the  array  by  simply 

calling  exp ort()  (or  import()},  as  for  example: 

export  ("X[2]"); 

In  practice,  one  would  like  to  access  the  ith  flfo,  where  i  is  an  integer  vari¬ 
able.  To  do  this,  a  mapping  function  called  cnvarr()  is  provided.  CnvarrQ  takes 
two  arguments,  a  character  string  specifying  a  base  name,  and  an  integer  vari¬ 
able  specifying  a  subscript.  Thus, 


cnvarr  ("X",  2); 

returns  the  character  string  ”X[2]".  No  checking  is  done  to  see  if  the  resulting 
name  corresponds  to  any  flfo  existing  in  the  system.  An  error  will  result  how¬ 
ever,  if  an  attempt  is  made  to  use  an  invalid  fifo  name. 

The  value  returned  by  cnvarr()  should  never  be  stored  into  another  vari¬ 
able.  It  should  only  be  used  as  the  argument  passed  to  a  simulator  defined  sub¬ 
routine  (e.  g.  put()  or  get(),  defined  below).  The  character  string  returned  is 
only  valid  until  the  next  time  the  task  executes  cnvarr().  This  is  because  each 
task  uses  a  single  buffer  for  holding  character  strings  generated  by  cnvarr(). 
Cnvarr()  always  returns  a  pointer  to  the  start  of  this  buffer.  Thus,  if  two  succes¬ 
sive  calls  to  cnvarr()  are  made,  the  result  of  the  second  will  overwrite  the  first. 

If  it  is  necessary  to  assign  the  pointer  returned  by  cnvarr()  to  another  vari¬ 
able,  the  cnvarrsvQ  routine  should  be  used.  This  is  identical  to  cnvarr(),  except 
the  storage  for  the  resulting  string  is  allocated  by  calloc(),  and  thus  remains 
indefinitely.  Calloc()  is  not  used  on  cadis  to  cnvarrQ  to  conserve  storage. 

3.3.  Sending  and  Receiving  Messages 

This  subsection  describes  routines  provided  by  the  simulator  for  sending 
and  receiving  messages.  In  addition,  routines  for  interrogating  the  status  of 
fifo' s  are  also  discussed. 

Three  functions.  put(),  puts(),  and  get()  are  available  for  putting  data  into 
export  fifo's  and  getting  data  from  import  fifo's.  A  routine  called  waitf()  allows  a 
task  to  wait  for  data  to  arrive  in  any  of  severed  import  fifo's.  Routines  called 
size()  and  qlength()  are  available  for  determining  the  size  of  the  next  message 
and  the  number  of  messages  in  an  import  flfo.  Finally,  a  routine  called  overfl() 
is  used  to  detect  overflow  of  import  fifo's  which  have  a  limited  length  (bounded 
fifo's).  The  following  is  a  detailed  description  of  each  of  these  routines. 

3.3.1.  PutQ  and  PutaQ 

These  two  routines  are  used  to  send  messages,  L  e.  put  data  into  an  export 
flfo.  The  put()  routine  takes  three  parameters.  The  first  is  a  character  string 
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specifying  the  name  of  the  export  fifo  data  is  to  be  put  into.  The  second  is  a 
pointer  to  the  first  byte  of  data  to  be  sent.  The  third  is  the  number  of  bytes  to 
be  sent.  When  called,  a  contiguous  block  of  data  L  bytes  long  (where  L  is  the 
third  parameter)  beginning  at  the  memory  location  specified  by  the  second 
parameter  is  copied  into  one  of  the  simulator's  internal  buffers  (forming  a  single 
message),  and  added  to  the  export  fifo  specified  by  the  first  parameter.  Thus 
each  element  of  the  fifo  is  a  block  of  data  (L  e.  a  message),  which  for  a  given  fifo. 

may  vary  in  length  from  element  to  element.  Thus,  the  statement 

put  (*X\  in,  sizeof(i)); 

sends  a  message  consisting  of  the  integer  (assume  i  is  declared  as  an  integer) 
which  is  the  current  value  of  the  variable  i. 

If  a  task  declares  an  export  and  import  fifo  of  the  same  name,  it  is  usually 
not  desired  that  a  put()  cause  the  task  to  send  data  to  itself.  Thus.  put()  only 
causes  data  to  be  sent  to  other  tasks.  In  some  situations  however,  it  may  be 
convenient  to  allow  a  task  to  receive  its  own  messages.  For  example,  suppose  a 
task  needs  to  process  data  which  is  sometimes  created  locally  within  that  task, 
but  other  times  remotely  in  some  other  task.  Allowing  a  task  to  send  data  to 
itself  allows  the  receiver  to  treat  the  two  cases  as  one.  For  this  purpose,  the 
function  puts()  (put  to  self)  is  defined.  Puts()  uses  the  same  parameters  as 
put().  The  first  parameter  should  be  the  name  of  a  fifo  which  is  both  an  export 
and  an  import  for  that  task.  When  a  task  executes  puts(),  it  sends  data  to  itself 
as  well  as  to  all  other  import  flfo’s  of  the  same  name.  If  a  task  executes  puts() 
on  an  export  fifo  for  which  it  does  not  also  have  an  import  fifo  of  the  same  name, 
an  error  will  result. 

3.3.2.  GetQ 

Get()  takes  two  parameters,  a  character  string  specifying  the  name  of  some 
import  fifo  created  by  the  task,  and  a  pointer  to  where  the  received  data  is  to  be 
stored.  When  called,  the  next  message  of  the  specified  import  fifo  (in  the  gen¬ 
eral  case,  a  block  of  data  of  arbitrary  length)  is  .-emoved  from  the  fifo  and 
copied  into  contiguous  memory  locations  starting  at  the  address  specified  by 


the  second  parameter.  If  the  specified  import  flfo  is  empty  when  get()  is  called, 
the  task  waits  for  data  to  arrive.  The  waiting  is  of  course,  transparent  to  the 

programmer.  Thus,  the  statement 

get  ("X",  &j); 

might  be  used  to  receive  the  value  of  i  sent  by  the  task  in  the  put()  example 
above.  When  the  get()  completes  execution,  j  will  hold  whatever  data  w as  sent. 
Note  that  the  above  implies  that  the  receiver  knows  what  type  of  data  is  being 
sent.  It  is  up  to  the  user  to  ensure  that  the  receiver  correctly  interprets  any 
data  sent  to  it.  If  the  next  message  in  flfo  X  held  (say)  a  floating  point  number, 
the  simulator  would  not  complain,  but  the  result  would,  in  general,  be  unpredict¬ 
able  if  the  receiver  were  expecting  an  integer. 

&3.3.  WaitfO 

Waitf()  takes  two  parameters.  The  first  is  an  array  of  character  strings  (i.  e. 
pointers  to  character  strings),  each  of  which  is  the  name  of  an  import  flfo 
created  by  that  task.  The  second  parameter  is  an  integer  specifying  the  length 
of  this  array.  When  executed,  the  simulator  checks  each  of  the  specified  import 
flfo's,  and  if  all  are  empty,  the  task  blocks.  When  a  message  arrives  in  one  of 
these  flfo's,  the  task  is  restarted  at  the  instruction  following  the  waitf().  If  one 
or  more  of  the  specified  flfo’s  is  not  empty  when  the  waitf()  is  executed,  then  the 
waitf()  acts  like  a  no-op. 

The  waitf()  function  allows  a  task  to  wait  for  data  to  arrive  in  any  of  a  set  of 
import  flfo's.  When  a  message  arrives  and  is  placed  into  one  of  these  import 
flfo's,  the  task  must  determine  which  flfo  the  message  arrived  in.  Clearly,  get() 
cannot  be  used  for  this  purpose,  since  a  get()  on  an  empty  flfo  will  cause  the 
task  to  block.  Two  routines  which  can  perform  this  function  are  size()  and 
qlength().  These  will  be  discussed  next. 

3.3.4.  SxeO  and  OlengthQ 

The  size()  function  takes  a  single  parameter  specifying  an  import  flfo 
created  by  the  task.  When  called,  it  returns  the  size  (in  bytes)  of  the  next  mes¬ 
sage  in  this  flfo.  If  the  flfo  is  empty,  sizeQ  returns  0.  This  routine  is  useful  for 


receiving  variable  length  data. 

Qlength()  also  takes  a  single  parameter  specifying  the  name  of  an  import 
fifo.  This  routine  returns  the  number  of  messages  currently  in  that  fifo.  If  the 
flfo  is  empty,  qlength()  returns  0.  It.  like  size(),  can  be  used  for  polling  fifo’s  to 
see  if  they  have  any  data  (see  discussion  of  the  waitf()  routine  defined  above). 
For  efficiency  reasons  however,  it  is  recommended  that  the  user  use  size()  for 
this  purpose. 

a&5.  OrerOO 

Overfl()  takes  a  single  parameter  specifying  an  import  flfo  created  by  the 
task.  Each  import  fifo  has  a  flag  associated  with  it  which  is  set  when  a  message 
arrives  in  a  bounded  flfo  (i.  e.  one  whose  length  is  limited  to  a  fixed  number  of 
messages)  which  is  full,  causing  the  oldest  message  in  the  fifo  (the  front  mes¬ 
sage)  to  be  lost.  OverflQ  returns  the  value  of  this  flag,  and  causes  the  flag  to  be 
reset.  Thus,  it  is  similar  to  the  overrun  bit  in  a  UART. 
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4.  Examples 


To  clarify  the  ideas  presented  so  far,  this  section  presents  several  examples 
of  the  use  of  the  functions  defined  above.  The  first  is  a  simple  use  of  export  and 
import  flfo’s  to  transport  an  array  from  one  task  to  another.  The  second  uses 
the  size()  and  waitf()  functions  to  implement  a  "server"  task.  Finally,  the  third 
example  demonstrates  the  creation  of  multiple  tasks  from  a  single  program  to 
compute  the  dot  product  of  two  vectors. 


4.1.  Exporting  an  Array 

This  example  demonstrates  how  one  can  export  an  array  of  length  N  from 
one  task  to  smother.  Two  tasks,  T1  and  T2,  respectively  send  and  receive  the 
array  A.  The  sending  task.  Tl.  is  defined  as: 


fdefine  N  100 

/*  size  of  array  */ 

Tl() 

int  A(N]; 

/*  task  to  send  array  •/ 

/*  array  to  be  sent  •/ 

export  ("A");  /*  create  export  flfo  A  */ 

getarray  (A);  /*  routine  to  input  array  A  */ 

put  ("A".  A.  N*sizeof(int));  /•  send  array  •/ 


getarray(A) 
int  A(J; 

/*  routine  to  input  array  */ 

\ 

int  i; 

for  (i=0;  i<N;  i++)  { 

/•  input  array  */ 

printf  ("enter  data:"); 
scanf  ("%d".  A  +  i); 

! 


The  receiving  task,  T2,  is  defined  as: 


T2() 

int  B[N]; 

import  (”A",0); 
get  ("A".  B); 


/*  task  to  receive  array  •/ 

/*  array  to  hold  result  */ 

/*  create  import  flfo  A  •/ 
/•  get  array  •/ 


The  bootstrap  routine  for  this  pair  of  tasks  is  defined  as: 

#include  <stdio.h> 

bootstrap() 

I 

int  Tl(),  T2():  /•  names  of  the  tasks  */ 

mktask  (”T1”,  Tl.  1.  NULL,  0); 
mktask  T2,  2,  NULL,  0); 

4.2.  A  Serrer  to  Square  Variables 

A  ’’server”  is  implemented  which  will  square  and  return  any  floating  point 
value  sent  to  it.  It  is  assumed  that  each  task  which  uses  the  server  has  an 
export  and  an  import  flfo  whose  name  is  *'A[i]”,  where  i  is  some  integer  assigned 
to  that  task.  To  use  the  server,  a  task  sends  a  floating  point  number  on  its 
export  flfo,  and  waits  for  the  result  to  arrive  on  its  import  flfo. 

In  this  example,  the  server  uses  the  waitf()  routine  to  wait  for  data  to  arrive 
in  one  of  N  flfo’s.  It  then  uses  size()  to  poll  the  flfo’s  to  determine  which  one 
received  the  data.  One  could  have  used  qlengthQ  instead  of  size().  however 
aize()  is  executed  more  efficiently  by  the  simulator.  Thus,  size()  is  recom¬ 
mended  when  testing  to  see  if  a  flfo  is  empty. 

The  server  first  sets  up  an  array  of  character  strings  (actually,  an  array  of 
pointers  to  character  strings)  which  lists  the  set  of  flfo’s  it  will  waitf()  on.  This  is 
followed  by  an  infinite  loop  which  consists  of  two  actions: 

(1)  wait  for  data  to  arrive  in  one  of  its  flfo’s 

(2)  poll  its  flfo's,  and  return  a  result  for  any  data  that  has  arrived. 

Thus,  the  server  task  SQUARE  is  defined  as: 

#include  <stdio.b> 
fdeflne  N  10 

SQUARE() 

I 

int  i; 

char  *p[N];  /•  pointers  to  flfo  names  •/ 

char  •cnvarrsv(),  *cnvarr(); 

float  x; 


e  arrays  of  ftfo's  •/ 


exparr  ("A”,  N);  /•  c 

imparr  ("A",  N,  0); 

/•  set  up  array  of  flfo  names  for  waitfQ  procedure  •/ 

for  (i=0;  i<N;  i++) 

p[i]  =  cnvarrsv  ("A”,  i);  /•  flfo  name  •/ 

/•  At  this  point,  p[0]  points  to  the  character  string 
••A[0]"lP[l]to  "A[lf\ etc.  •/ 

for  ( ; ; )  {  /•  loop  forever  •/ 

waitf  (p,  N);  /•  wait  for  data  •/ 

/*  Execute  the  following  when  data  arrives. 

Find  flfo  with  data  and  return  result-  •/ 

for  (i=0  ;  i<N  ;  i++) 

if  (size  (cnvarr("A",  i))  1=  0)  { 
get  (cnvarr("A'‘,  i),  lex); 

x  •=  x; 

put  (cnvarrf"A",  i),  kx. 
sizeof(float)); 

j  /*  Infinite  loop  •/ 


Note  that  cnvarrsv()  is  used  instead  of  cnvarr()  whenever  the  value  returned  is 
assigned  to  a  variable.  A  task  using  the  server  might  be  defined  as  (in  file 

USER.c): 

USER(x) 
char  x(j; 

i 

float  z; 

export  (x); 
import  (x.  0); 


put  (x.  4z,  size  of  (float)); 
get  (x,  kz); 


The  bootstrap  routine  must  pass  a  parameter  to  the  user  task  specifying 
the  name  of  the  flfo  it  is  to  use.  Suppose  we  want  to  create  N  tasks  from  the 
USER  program  defined  above.  To  do  this,  bootstrapQ  might  look  like: 


bootstrapQ 


int  i,  id.  strlen(); 
char  •cnvarr(); 


id  =  1; 


/•  id  of  created  tasks  •/ 


for  (i=0*.  i<N;  i++) 

znktask  (‘USER",  USER,  id++,  cnvarr  ("A",  i). 
strlen(cnvarr  ("A”,  i))  +  l); 

mktask  ("SQUARE".  SQUARE.  id++.  NULL.  0); 


The  above  code  uses  the  predefined  C  routine  strlenQ  to  get  the  length  of 
the  string  s  (in  bytes).  A  one  is  added  to  strlen  because  of  the  C  convention  of 
appending  a  null  (\0)  character  to  mark  the  end  of  character  strings. 

Finally,  note  that  even  though  many  tasks  are  created  from  the  USER  pro¬ 
gram,  the  init()  routine  need  not  be  used  (SQUARE  does  not  use  init()  because 
multiple  instances  are  not  created).  This  is  because  USER  does  not  have  any 
global  or  static  variables.  The  next  example  will  demonstrate  the  use  of  init(). 

4.2.1.  Dot  Product 

This  example  computes  the  dot  product  of  a  pair  of  vectors.  A  and  B.  Three 
programs  are  used.  The  first  (called  T)  multiplies  two  array  elements  together. 
For  arrays  of  length  N.  there  are  N  tasks  generated  from  this  program.  T  uses 
global  variables  so  that  use  of  the  init()  routine  can  be  demonstrated.  The 
second,  called  TD1STR.  distributes  the  data  to  these  N  tasks,  and  the  third. 
TRESULT.  collects  the  results  and  adds  them  up  to  form  the  dot  product  in  the 
variable  sum.  Cne  task  is  created  for  each  of  TDISTR  and  TRESULT. 

The  parameter  and  global  data  structures  are  defined  first: 
struct  parm  /•  format  of  parameters  •/ 


M 


/•  input  flfo  name  •/ 


struct 

\ 

float  a,  b,  c; 


/•  globals  for  task  •/ 
/•  work  variables  •/ 


Task  T  is  denned  as  follows: 

T  (p)  /•  multiply  numbers  •/ 

struct  parm  *p;  /*  x  and  y  are  parameters 

specifying  import  flfo's  •/ 


init  (&glob,  sizeof  (glob)); 

import  (p->x,  0);  /*  create  flfo's  •/ 

import  (p->y.  0): 

export  ("C");  /•  result  fifo  •/ 

get(p->x.  Jcglob.a);  /•  get  operands  •/ 

get(p->y,  acglob.b); 

glob.c  =  glob. a  •  glob.b; 

j>ut  ("C'\  icglob.c,  sizeof  (float));  /•  result  */ 

^define  N  100 
TDISTR  () 

float  A[N],  B[N];  /*  vector  operands  •/ 

int  i; 

exparr  ("A”,  N);  /*  arrays  of  flfo’s  */ 

exparr  ("B”.  N); 


«input  data  into  A[]  and  B[]» 


for  (i=0  ;  i<N  ;  i++)  {  /•  loop  to  send  data  •/ 
put  (cnvarr  ("A”,  i),  A+i,  sizeof(float)); 
jsut  (cnvarr  ("B”,  i),  B+i,  sizeof(float)); 


Note  that  the  put()  routine  takes  a  pointer  to  the  data  as  an  argument,  and 
not  the  data  itself.  Thus,  ”A+ i”  and  "B+i"  are  used  rather  than  ”A[i]’’  and  "B[i]". 

And  Anally,  the  task  which  collects  the  multiplied  operands  and  generates 

the  result,  TRESULT,  is; 

TRESULT  () 

float  sum,  temp; 
int  i; 


import  ("C",  0); 


/•  import  ftfo’s  •/ 

sum  =  0.0;  /•  add  results  from  T  */ 

for  (i=0  ;  i<N  ;  i++)  { 
get  ("C".  ictemp); 
sum  +=  temp; 


Note  that  all  the  values  of  A[i]  •  B[i]  collect  in  flfo  ”C"  of  TRESULT.  The 
order  in  which  the  data  arrives  is  arbitrary,  but  this  is  of  no  consequence  here. 

The  bootstrapQ  routine  for  dot  product  would  look  like: 


^include  <stdio.h> 
bootstrap() 

int  i,  id; 

int  T(),  TD1STR().  TRESULT(); 
char  •cnvarr(); 
struct  parm  p; 

id  =  1; 

/•  create  tasks  from  program  T  •/ 

for  (i=0;  i<N;  i++)  { 

strcpy  fp.x,  cnvarr  ("A",  \)Y, 
strcpy  (p.y,  cnvarr  ("B",  i)); 
mktask  (cnvarrCT',  i),  T,  id++, 
4cp.  sizeof  ip))-. 


/•  create  other  tasks  •/ 

mktask  ("TD1STR",  TDISTR,  id++,  NULL,  0); 
mktask  {’TRESULT",  TRESULT,  id++,  NULL,  0); 

i 

Note  that  the  parameters  passed  to  the  tasks  in  the  calls  to  mktask()  do 
not  contain  pointers.  Storage  for  the  strings  x  and  y  is  allocated  within  the 
parm  structure  itself,  rather  than  using  pointers  to  character  strings.  The 
predefined  routine  strcpy()  is  used  to  copy  characters  into  the  parameter  Ust. 
Finally,  notice  the  use  of  cnvarrQ  in  mktask()  to  create  different  names  for  the 
tasks  defined  using  the  program  T. 


5.  TIMING  MODEL 
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A  timing  model  is  clearly  necessary  if  the  simulator  is  to  be  used  for  any 
kind  of  performance  evaluation,  whether  of  communications  networks  or  execu¬ 
tion  of  parallel  algorithms.  This  section  will  describe  the  timing  model  provided 
by  the  simulator  as  well  as  a  high  level  view  of  how  it  works. 

Each  processor  has  a  clock  which  keeps  track  of  time  for  the  task  running 
on  that  processor  (recall  that  each  processor  is  assumed  to  execute  at  most  one 
task).  In  the  real  multiprocessor,  all  of  the  clocks  would  advance  simultane¬ 
ously  with  real  time,  and  thus  would  (at  least  ideally)  indicate  the  same  time. 
Here  however,  execution  of  tasks  is  time  multiplexed  on  the  host  computer  run¬ 
ning  the  simulator.  Thus,  at  any  moment  in  the  simulation  run.  clocks  on 
different  tasks  will  usually  differ.  The  clock  for  each  task  will  reflect  how  far 
that  task  has  progressed  in  its  computations. 

When  the  simulator  begins  executing  a  task,  the  clock  for  that  task  is 
started.  At  some  later  time,  when  execution  is  switched  to  another  task,  the  ori¬ 
ginal  task's  clock  is  stopped.  A  task's  clock  also  advances  if  the  task  becomes 
idle.  e.  g.  when  the  task  must  wait  for  data  to  arrive  from  some  other  processor. 
All  tasks  initially  begin  execution  with  their  clocks  set  at  ”0". 

Using  this  clock,  the  simulator  can  estimate  when  events  occur  in  the  real 
system,  such  as  when  tasks  send  messages,  when  messages  arrive  (actually,  the 
switch  model  determines  this),  etc.  The  simulator  also  uses  the  timing  model  to 
accumulate  statistics  on  the  task,  such  as  how  long  it  spent  executing,  and  how 
long  it  was  waiting  for  data  Such  statistics  are  printed  out  at  the  end  of  each 
simulation  run. 

Implementation  of  the  timing  model  requires  that  the  user  compile  his  pro¬ 
grams  with  a  modified  version  of  the  C  compiler  (cc).  Let  us  call  this  modified 
version  simcc.  Simcc  is  like  cc  except  that  in  addition  to  generating  VAX  assem¬ 
bler  instructions  for  the  code  being  compiled  (which  are  automatically  piped 
into  the  assembler  program),  it  inserts  additional  instructions  which  will  incre¬ 
ment  the  clock  for  the  task  as  it  executes.  Thus,  at  least  conceptually, 
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execution  of  each  assembler  instruction  generated  by  cc  is  followed  by  the  exe¬ 
cution  of  an  inserted  instruction  which  causes  the  clock  for  that  task  to  advance 
by  an  amount  of  time  equal  to  the  execution  time  of  that  instruction.  An  option 
to  simcc  allows  the  user  to  assign  an  execution  time  to  each  VAX  instruction  (i. 
e.  each  opcode).  Use  of  this  option  is  described  in  Appendix  IV.  The  current 
implementation  assigns  a  default  execution  time  of  one  microsecond  to  each 
instruction  if  no  such  assignment  is  made. 

At  present,  there  is  one  routine  available  to  the  user  pertaining  to  the  tim¬ 
ing  modeL  This  is  the  getclk()  routine,  which  allows  a  task  to  read  it's  own 
clock.  Getclk()  takes  a  single  parameter,  the  id  of  the  executing  task.  It 
returns  an  integer  which  is  the  number  of  microseconds  that  have  elapsed  since 
the  task  was  created  (recall  that  all  tasks  are  assumed  to  be  created  at  time  0). 


6.  THINGS  TO  LOOK  OUT  FDR 


Unfortunately,  the  simulator  could  not  be  designed  in  a  manner  that  would 
detect  all  possible  errors  the  user  might  commit.  Because  of  this,  some  rather 
subtle  bugs  may  be  arise.  This  section  outlines  some  "easily  made"  mistakes  the 
user  might  encounter  in  writing  programs  for  the  simulator. 

8.1.  Global  and  Static  Variables 

As  pointed  out  earlier,  if  multiple  instances  of  a  task  are  created,  and  if  the 
program  for  these  tasks  uses  global  or  static  variables,  the  init()  function  MUST 
be  used.  Since  the  simulator  cannot  determine  which  tasks  fall  under  this 
category,  it  cannot  detect  those  which  neglect  this  detail. 

Tasks  neglecting  to  use  init()  when  they  should  may  find  that  variables  mys¬ 
teriously  change  without  being  assigned  to!  This  is  because  the  simulator  will 
not  save  or  restore  the  values  of  these  variables  when  execution  switches  from 
one  task  to  another.  Thus,  in  effect,  all  of  the  tasks  created  from  that  program 
use  the  same  sets  of  variables,  allowing  one  task  to  change  another’s.  Needless 
to  say.  the  results  are  generally  unpredictable. 

6.2.  Use  of  CnvarrO 

The  user  should  never  assign  the  value  returned  by  cnvarr()  to  a  variable. 
Subsequent  calls  to  cnvarr()  will  change  the  value  of  this  variable,  again  without 
the  use  of  an  assignment  statement.  As  noted  earlier,  this  is  because  the  result 
generated  by  cnvarr()  is  always  stored  in  a  single  buffer  (actually,  each  task  has 
its  own  buffer).  Thus,  each  call  overwrites  the  value  generated  by  the  previous 
call. 

If  the  value  is  to  be  assigned  to  a  variable,  cnvarrsv()  should  be  used.  The 
result  is  then  stored  in  a  data  area  created  via  the  calloc()  function.  It  is  left  to 
the  user  to  release  this  storage  (via  cfreeQ)  when  he  is  done  using  it. 


6.3.  Variable  Sharing 

Two  tasks  should  never  access  the  same  variables.  Doing  so  effectively 
allows  tasks  to  communicate  without  using  the  switch  model,  thus  invalidating 
any  statistics  gathered  during  the  simulation  run. 

6.4.  Fifo  Names 

Blanks  imbedded  within  fifo  names  are  significant.  Thus,  *T  ”,  "  T”,  and  ’T’ 
are  three  distict  fifo’s.  Thus,  a  message  placed  in  export  fifo  "T"  will  not  be  sent 
to  import  flfo’s  "  T"  or  *T 


7.  IMPLEMENTATION 


This  section  describes  the  implementation  of  Simon.  It  is  intended  for  a 
programmer  who  is  planning  to  modify  or  augment  Simon,  or  for  someone 
interested  in  developing  new  switch  models.  A  high  level  description  of  the 
operation  of  Simon  is  described,  as  well  as  details  of  the  manner  in  which  the 
program  is  physically  partitioned  into  modules  (i.e.  files)  and  the  interfaces  and 
interactions  between  modules.  Readers  interested  in  low  level  details  (e.g.  the 
order  and  types  of  parameters  in  subroutine  calls)  should  refer  directly  to  the 
source  code  for  Simon.  It  is  assumed  that  the  reader  is  familiar  with  the  user's 
interface  to  Simon  (seen  by  the  applications  programmer)  described  earlier.  A 
general  knowledge  of  the  interned  workings  of  compilers,  operating  systems,  and 
event  driven  simulation  programs  would  also  be  helpful,  though  not  strictly 
necessary. 

At  the  highest  level  of  abstraction,  Simon  models  the  multicomputer  sys¬ 
tem  as  a  sequence  of  state  transitions.  In  Simon,  internal  variables  keep  track 
of  the  current  state  of  the  real  system  at  any  instant  in  time.  Transitions  are 
modeled  as  events.  Just  as  the  real  system  moves  from  state  to  state  following 
certain  well-defined  state  transitions,  Simon’s  internal  variables  move  from 
value  to  value  following  certain  well-defined  events.  The  model  used  by  Simon  to 
represent  the  state  of  the  multicomputer  system,  and  the  events  modeling  tran¬ 
sitions  between  states  are  defined  here. 

Simon  is  implemented  as  a  discrete  time,  event  driven  simulation  program. 
It  is  discrete  time  (in  contrast  to  continuous  time)  because  the  set  of  times  at 
which  state  transitions.  Le.  events,  can  occur  is  a  countable  (and  finite  because 
the  precision  of  the  host  computer  is  limited)  set.  It  is  event  driven  because 
Simon  executes  by  processing  events,  and  stops  when  there  are  no  more  events 
to  process.  Each  event  is  timestamped  to  indicate  the  time  at  which  the  event 
occurs  in  the  real  system.  Events  are  processed  in  time  increasing  order,  and 
generally  causes  some  modification  in  the  state  of  the  system  and  generate,  or 
schedule,  additional  events.  Thus  the  simulator  always  has  a  backlog  of  events 
waiting  to  be  processed.  It  continually  tries  to  clear  this  backlog  by  processing 


events,  one  after  another,  however  in  doing  so,  keeps  adding  to  the  backlog! 
When  the  backlog  is  finally  cleared,  the  simulation  run  is  complete. 

The  operation  of  Simon  consists  of  setting  up  some  initial  state,  and  then 
processing  events  to  model  the  state  transitions  of  the  real  system.  Roughly 
speaking,  Simon  is  composed  of  the  following  components: 

(1)  Events. 

(2)  Tasks. 

(3)  Fifos  and  Virtual  Circuits. 

(4)  Messages. 

(5)  Timing  Model. 

A  high  level  understanding  of  the  internal  workings  of  Simon  can  be  obtained  by 
understanding  the  events  defined  hy  Simon,  and  how  they  modify  the  state  of 
the  system.  These  events  will  be  discussed  first.  The  remaining  components 
model  the  state  of  the  multicomputer  system,  and  will  be  discussed  in  subse¬ 
quent  sections. 

First  let  us  say  something  about  the  overall  organization  of  the  Simon  pro¬ 
gram.  The  source  code  is  distributed  across  several  files,  with  each  file  contain¬ 
ing  the  code  for  performing  some  logical  function.  The  code  for  managing  tasks 
for  example,  reside  in  the  file  "task.c".  The  different  files  and  the  functions 
they  perform  are  listed  in  Appendix  II.  Interactions  between  files  is,  with  only  a 
few  exceptions,  through  procedure  calls.  There  are  very  few  global  variables. 
Although  this  results  in  some  performance  penalty  since  procedure  calls  are 
more  expensive  than  accesses  to  global  variables,  it  improves  the  modularity 
and  readability  of  the  resulting  code.  The  emphasis  in  developing  Simon  was 
towards  modularity  to  promote  readability  and  to  reduce  the  difficulty  of  mak¬ 
ing  changes  to  the  code.  Modifications  to  improve  performance  have  not  been 
attempted  at  the  time  this  document  was  prepared,  but  is  certainly  an  area  of 
future  work. 

Throughout  this  document,  the  convention  will  be  used  that  when  the  name 
of  a  routine  is  referred  to,  the  file  in  which  it  resides  will  follow  in  bold-face 


print.  For  example,  save{)  (task.c)  refers  to  the  save  routine,  which  resides  in 
the  file  task.c.  Routines  residing  in  the  switch  model  will  be  assumed  to  reside 
in  a  file  called  swmoic  (the  simplest  switch  model),  unless  specified  otherwise. 

7.1.  Events 

State  transitions  in  the  real  system  are  modeled  in  Simon  by  events.  A  data 
structure  called  the  event  queue  keeps  a  time  sorted  list  of  events  waiting  to  be 
processed.  The  main()  (mainsim.c)  program  of  Simon  repeatedly  executes  the 
following  actions: 

(1)  Take  the  next  event  off  of  the  event  queue. 

(2)  Processes  the  event  by  modifying  some  of  Simon’s  internal  variables  and/or 

scheduling  other  events. 

Execution  terminates  when  the  event  queue  becomes  empty. 

The  real  multicomputer  system  consists  of  a  number  of  autonomous  com¬ 
puters,  each  executing  some  part  of  a  parallel  application  program.  How  does 
one  go  about  selecting  the  events  for  Simon?  Removing  an  event  from  the  event 
queue,  processing  it.  and  then  scheduling  other  events  is  a  fairly  time  consum¬ 
ing  process,  so  it  is  unreasonable  to  go  through  this  scenario  each  time  the 
application  program  modifies  one  of  it’s  variables.  Instead,  Simon  allows  each 
task  to  execute,  modifying  it’s  own  variables.  Each  task  executes  directly  on  the 
host  computer.  Thus,  under  this  scheme,  only  one  type  of  event  is  required:  an 
event  which  initiates  execution  of  a  task.  When  the  simulation  begins,  the  event 
queue  is  initialized  to  hold  one  such  event  for  each  task  defined  by  the  applica¬ 
tion  program,  and  each  task  is  executed  to  completion,  one  after  the  other.  In 
Simon,  such  an  event  which  marks  the  initialization  of  a  task  is  called  an  ”init- 
task  event",  and  as  just  described,  one  is  scheduled  for  each  task  created  by  the 
•’bootstrap"  program  as  part  of  the  initialization  process. 

After  a  little  thought  however,  it  is  easy  to  see  that  this  simple  scheme  will 
not  work  if  tasks  interact  with  each  other,  e.g.  by  exchanging  messages.  Clearly 
a  task  cannot  execute  to  completion  if  uses  data  generated  by  another  task 
which  has  not  yet  begun  execution.  Thus,  other  events  are  necessary  to 


simulate  interactions,  and  to  force  one  task  to  temporarily  stop  executing  while 
other  tasks  are  allowed  to  "catch  up”.  Since  all  interactions  between  proces¬ 
sors  occur  via  messages,  let  us  define  two  new  events: 

(1)  The  "send  event”  denotes  a  processor  sending  a  message. 

(2)  The  “arrive  event”  denotes  a  message  arriving  at  a  processor. 

The  send  event  is  generated  when  a  processor  executes  the  put()  (lib.c)  routine. 
Since  a  send  event  denotes  a  message  entering  the  communication  switch,  it 
must  be  processed  by  the  switch  model.  As  a  consequence  of  processing  the 
send  event,  the  switch  model  eventually  schedules  an  arrive  event  indicating 
that  the  message  has  reached  it's  final  destination. 

Getting  back  to  our  original  scheme,  we  see  that  the  two  additional  events 
described  above  are  still  not  sufficient,  since  we  still  need  a  mechanism  to  tem¬ 
porarily  stop  execution  of  tasks  on  the  host  computer  so  that  others  can  exe¬ 
cute  and  generate  the  data  they  need.  One  more  event,  called  the  "get  event”, 
is  defined  for  this  purpose.  When  a  task  needs  data  from  another  task,  it  calls  a 
routine  defined  by  Simon,  e.g.  the  get()  (lib.c)  routine.  Simon  temporarily  stops 
execution  of  the  task,  and  schedules  a  get  event  to  note  the  fact  that  this  task 
has  been  stopped.  Simon  then  goes  back  to  the  event  queue  and  processes 
other  events.  Eventually,  when  data  arrives  for  this  task,  it  will  be  restarted, 
and  allowed  to  continue  execution. 

Consider  the  sequence  of  events  that  occur  when  two  tasks  are  created,  A 
and  B.  Suppose  task  A  sends  a  single  message  to  task  B.  Because  of  timing  vari¬ 
ations,  the  scenario  which  will  now  be  described  is  not  the  only  one  possible, 
however  it  illustrates  the  interactions  of  the  events  described  above.  Initially, 
the  event  queue  consists  of  two  events,  namely  the  inittask  events  for  A  and  B. 
Suppose  that  the  inittask  event  for  task  B  appears  ahead  of  that  for  A  in  the 
event  queue.  Task  B  begins  execution,  and  request  to  receive  the  message  by 
executing  the  get()  (lib.c)  routine.  Simon  schedules  a  get  event,  and  blocks  the 
task.  The  next  event  is  the  inittask  event  for  A,  so  A  begins  execution.  Task  A 
sends  the  message  by  executing  the  put()  (lib.c)  routine.  This  causes  a  send 


event  to  be  scheduled  into  the  event  queue.  Task  A  continues  to  execute,  since 
as  will  be  seen  later,  there  is  no  reason  to  stop  it  now.  Let  us  assume  that  A 
completes  execution,  returning  control  back  to  Simon.  Returning  to  the  event 
queue,  we  see  there  are  two  events:  the  get  event  denoting  B’s  request  to 
receive  the  message,  and  the  send  event  denoting  the  message  being  sent  Let 
us  assume  the  send  event  precedes  the  get  event  in  the  queue.  Since  this  event 
signals  a  message  entering  the  switch,  the  switch  model  is  called  upon  to  simu¬ 
late  the  transmission  of  the  message.  In  doing  so,  the  switch  model  can  perform 
arbitrarily  complex  (or  arbitrarily  simple)  operations,  perhaps  scheduling  some 
of  its  own  events.  The  final  result  is  that  it  schedules  an  arrive  event  which  indi¬ 
cates  that  the  message  has  reached  it's  destination.  Now  the  queue  holds  an 
arrive  event  and  B's  get  event  The  two  possible  ordering  of  these  events  in  the 
queue  correspond  to  the  two  situations  which  can  occur  in  the  real  multicom¬ 
puter  either  the  message  arrives  before  B  asks  for  it,  implying  B  does  not  have 
to  wait  for  it,  or  the  message  arrives  after  B  asks  for  it  and  B  must  become  idle, 
waiting  for  data. 

Consider  the  first  case.  Here,  the  arrive  event  is  ahead  of  the  get  event. 
Processing  of  this  event  merely  corresponds  to  Simon  noting  that  the  message 
has  arrived,  it  places  the  message  in  the  appropriate  import  fifo.  The  get  event 
is  the  processed.  The  message  is  removed  from  the  fifo  and  passed  to  task  B. 
The  task  is  restarted,  and  completes  execution.  Since  the  event  queue  is  now 
empty,  the  simulation  is  complete. 

Consider  the  second  case.  Here,  the  get  event  precedes  the  arrive  event, 
indicating  the  task  asked  for  the  data  before  it  had  arrived.  In  the  real  multi¬ 
computer,  this  corresponds  to  the  task  blocking,  waiting  for  data  to  arrive. 
Here,  Simon  marks  the  task  as  "blocked”,  and  goes  back  to  process  the  next 
event  in  the  event  queue.  The  only  remaining  event  is  the  arrive  event.  Simon 
notices  that  task  B  is  blocked,  waiting  for  the  arrive  event,  so  it  unblocks  the 
task,  passes  it  the  message,  and  allows  it  to  continue  execution.  Ag  ain,  B  com¬ 
pletes  execution,  and  the  simulation  is  complete.  It  is  important  to  distinguish 
between  the  two  types  of  "blocking"  performed  by  Simon.  In  the  first,  the  task 


was  blocked  to  allow  another  task  to  "catch  up".  This  "blocking"  is  just  an 
artifact  of  the  simulation  technique,  and  does  not  correspond  to  a  task  being 
blocked  in  the  real  system.  The  second  blocking  corresponds  to  a  task  waiting 
for  data  in  the  real  system  Note  that  in  Simon,  the  first  type  of  blocking  is 
marked  by  a  get  event  in  the  event  queue,  while  the  second  type  of  block  does 
not  have  any  such  events  scheduled,  The  task  is  later  restarted  by  an  arrive 
event. 

With  the  scheme  described  above,  each  task  has  its  own  clock  which  indi¬ 
cates  how  much  time  has  elapsed  since  the  task  began  execution.  It  is  impor¬ 
tant  to  realize  that  at  any  given  time  in  the  simulation,  different  tasks  will  usu¬ 
ally  have  different  values  on  their  respective  clocks.  This  points  out  another 
important  function  of  the  event  queue:  namely  to  ensure  that  interactions 
among  tasks  are  simulated  in  the  proper  time  sequence.  Since  some  task's 
clocks  are  ahead  of  others,  it  is  important  not  to  let  a  task  get  “too  far"  ahead. 
For  example,  suppose  tasks  "A"  and  “B"  each  send  a  message  to  task  "C"  using 
some  fifo,  say  "foo".  Clearly,  in  the  real  system,  either  the  message  from  A  will 
arrive  at  C  first,  or  the  message  from  B  will  arrive  first.  Suppose  the  one  from  A 
arrives  first.  Suppose  B  was  allowed  to  execute,  scheduling  a  send,  and  eventu¬ 
ally  an  arrive  event  for  it's  message.  Now  suppose  C  executed.  It  executes  a 
get()  (lib.c),  and  erroneously  receives  the  message  from  B.  Only  later  is  it 
discovered  that  A  generated  a  message  which  arrives  before  the  message  from 
B.  and  that  the  simulation  has  been  compromised!  Happily,  the  scenario 
described  above  cannot  occur  in  Simon.  The  simulation  cannot  be  compromised 
if  it  executes  interactions  between  tasks  in  the  proper  time  sequence.  However, 
since  all  interactions  go  through  the  time  sorted  event  queue,  events  residing  in 
the  queue  cannot  be  simulated  in  the  wrong  order.  The  only  way  events  can  be 
simulated  out  of  sequence  is  if  an  event  is  scheduled  with  a  timestamp  preced¬ 
ing  that  of  the  event  now  being  processed.  However,  this  implies  an  event 
causes  another  event  which  occured  earlier  in  time  than  the  first,  e.g.  a  message 
arriving  before  it  was  sent!  Since  this  cannot  happen  in  the  real  system,  it  can¬ 
not  happen  in  the  simulator,  so  long  as  events  correspond  to  occurances  in  the 


real  system. 

From  this  perspective,  the  purpose  of  the  event  queue  takes  on  a  new  light. 
The  event  queue  is  no  longer  simply  a  ’’warehouse”  of  events  waiting  to  be  pro¬ 
cessed.  In  addition,  it  acts  as  a  sequencer  which  ensures  that  events  modeling 
the  real  system  are  processed  in  the  correct  time  sequence.  Thus,  it  is  easy  to 
see  when  Simon  must  temporarily  block  a  task,  and  when  it  may  allow  the  task 
to  continue  executing:  when  a  task's  behavior  may  be  affected  by  an  interaction 
with  another  task.  Simon  must  block  the  task  and  schedule  an  event  on  the 
event  queue.  Otherwise,  the  task  may  be  allowed  to  continue  executing.  Using 
this  rule,  it  is  now  clear  that  we  do  not  need  to  block  a  task  executing  a  put() 
(lib.c),  and  similarly,  we  do  not  need  to  block  a  task  executing  a  get  if  the 
corresponding  import  fifo  holds  at  least  one  message.  The  latter  results  from 
the  fact  that  Simon  places  data  into  each  import  fifo  in  correct  time  sequence 
(since  only  arrive  events  cause  additions  to  import  flfo’s,  and  all  arrive  events 
are  guaranteed  to  be  sequenced  correctly),  so  no  new  message  will  be  placed  in 
front  of  another  message  already  in  the  fifo. 

Finally,  one  other  event  is  defined.  This  is  the  ’’initsw”  event.  This  event 
occurs  exactly  only  once,  at  the  beginning  of  every  simulation  run,  and  is  used 
to  initialize  the  switch  model. 

In  summary,  five  different  types  of  events  have  been  defined: 

(1)  inittask:  initiate  execution  of  a  task. 

(2)  send:  send  message  into  communication  network. 

(3)  arrive:  message  arrives  from  the  communication  network. 

(4)  get:  task  pauses  because  it  interacts  with  another  task. 

(5)  initsw:  initialize  switch  model. 

Each  of  these  will  now  be  described  in  greater  detail.  When  these  events  are 
removed  from  the  event  queue,  the  main  program  calls  a  particular  routine  to 
process  that  event.  The  name  of  this  routine  and  the  functions  it  performs  will 
be  discussed. 


7.1.1.  The  Brent 

Simon  initially  places  an  inittask  event  in  the  event  queue  for  each  task 
created  in  the  bootstrap()  routine.  This  event  is  processed  by  the  routine  init- 
taskev()  (erhamLc).  It  calls  the  routine  sttask()  (mainaim.c))  which  sets  some 
flags  indicating  that  the  task  is  to  begin  execution.  Control  is  then  returned  to 
the  main  program.  Note  that  sttask()  does  not  begin  execution  of  the  task 
directly,  but  only  sets  some  flags  representing  a  request  to  start  the  task.  After 
processing  each  event,  the  main  program  checks  to  see  if  the  event  it  has  just 
finished  processing  requested  to  start  a  task.  If  so,  it  begins  executing  the  task 
via  an  ordinary  procedure  cedi.  Restarting  tasks  (after  being  stopped  because 
of  interactions  with  other  tasks)  also  follow  this  same  procedure,  Le.  the  event 
handler  sets  flags  to  request  the  restart,  and  the  task  is  later  restarted  by  the 
main  program.  By  using  this  indirect  scheme  for  starting/restarting  the  execu¬ 
tion  of  tasks,  all  tasks  begin  execution  from  the  same  point  in  the  main  pro¬ 
gram.  This  ensures  that  the  base  of  the  runtime  stack  always  occupies  the  same 
memory  location,  and  it  simplifies  the  task  switching  operation  because  all 
returns  from  the  task  (either  to  stop  it  temporarily  or  permanently)  resume 
execution  at  the  same  place.  After  a  task  stops  execution,  the  next  function 
performed  is  the  removal  and  processing  of  the  next  event  in  the  event  queue. 

7.1.2.  The  Send  Event 

The  send  event  is  scheduled  by  the  put()  and  puts()  (lib.c)  routines,  and  is 
processed  by  the  sendevQ  (swmodc)  routine.  The  functions  performed  by  this 
routine  will  be  described  later  in  the  discussion  on  switch  models. 

7.1.3.  The  Arrive  Brent 

The  arrive  event  is  scheduled  directly,  or  indirectly  through  the  sendevQ 
(•wmodLc)  routine,  and  is  processed  by  arriveev()  (erhancLc).  The  event  handler 
must  add  the  message  to  the  destination  import  fifo.  Also,  if  a  task  is  blocked, 
waiting  for  this  message  to  arrive,  the  sttaskQ  (task.c)  routine  is  called  to 
request  that  the  task  be  restarted. 


A  get  event  may  be  scheduled  by  any  routine  which  has  an  outcome  depend¬ 
ing  on  some  interaction  with  another  task.  Currently,  the  routines  which  fall 
into  this  category  are:  get(),  overflQ,  qlengthQ,  size(),  and  waitf()  (lib.c).  These 
routines  may  or  may  not  cause  a  get  event  to  be  scheduled.  If  it  can  be  deter¬ 
mined  that  the  result  of  this  routine  is  not  affected  by  anything  any  other  task 
can  do,  then  a  get  event  need  not  be  scheduled.  For  example,  if  the  get()  rou¬ 
tine  is  executed  on  a  flfo  which  already  holds  at  least  one  message,  then  this 
routine  will  always  return  the  first  message  in  the  flfo,  regardless  of  any  other 
messages  that  may  arrive.  The  get(),  sizeQ,  and  waitf()  routines  do  not  schedule 
a  get  event  if  the  import  flfo  specified  by  the  routine  (or  in  the  case  of  waitf(),  at 
least  one  flfo)  has  a  message.  Otherwise,  a  get  event  is  scheduled.  The  overfl() 
and  qlengtb()  routines  always  schedule  a  get  event  because  they  must  wait  until 
all  other  tasks  have  "caught  up"  with  the  current  one  in  order  to  determine  the 
result  returned  by  the  routine. 

Let  us  now  consider  the  processing  of  the  get  event  when  it  appears  at  the 
head  of  the  event  queue.  Five  different  types  of  get  events  exist,  distinguished 
by  the  routine  which  caused  the  event.  All  types  are  processed  by  the  getev() 
(evhancLc)  routine.  If  the  event  was  scheduled  by  the  get()  routine,  the  task  is 
restarted  if  the  flfo  the  get  was  performed  on  now  has  a  message.  The  restart 
request  is  performed  by  calling  sttask()  (taslc.c),  as  described  above.  In  this 
case,  the  message  arrived  before  the  task  needed  it.  Otherwise,  the  task  is 
marked  "blocked"  via  the  blktskQ  (task.c)  routine  signifying  that  the  task  is 
idle,  waiting  for  data  in  the  real  system.  The  "waitf”  get  event  is  handled  in  a 
similar  fashion.  The  remaining  get  event  types  all  cause  the  task  to  be  res¬ 
tarted,  since  they  only  ask  for  information  which  can  now  be  determined  with 
complete  certainty  (e.g.  the  number  of  messages  in  a  flfo). 

7. 1.5.  Other  Event  Related  Routines 

A  number  of  other  routines  are  defined  relating  to  the  event  queue.  These 
routines  are  defined  in  the  file  evnt.c.  Each  type  of  event  has  a  separate  routine 


for  scheduling  an  event,  e.g.  the  scsend()  routine  schedules  a  send  event.  All 
call  the  scheduleQ  routine  which  inserts  the  event  into  the  time  ordered  event 
queue.  Routines  are  provided  to  remove  events  from  the  event  queue,  and  to 
return  parameters  for  the  different  event  types.  The  initevq()  routine  initializes 
the  event  queue,  notevqempty()  checks  to  see  if  the  queue  is  empty,  and  the 
prevq()  routine  prints  the  contents  of  the  event  queue  onto  the  standard  output. 
Ihis  last  routine  is  provided  for  debugging  purposes.  Finally,  a  number  of  other 
routines  are  provided  for  internal  storage  management  purposes. 

7.2.  Tasks 

Simon  maintains  information  on  each  task  to  allow  them  to  time  multiplex 
their  execution  on  the  host  computer,  and  to  collect  statistics.  In  many 
respects,  the  management  of  tasks  on  Simon  resembles  the  management  of 
processes  in  a  uniprocessor  operating  system.  First,  Simon  must  be  able  to 
create  and  begin  the  execution  of  a  task  when  an  inittask  event  occurs.  When  a 
get  event  occurs,  the  task  must  stop  execution,  and  another  task  must  be  res¬ 
tarted.  Finally,  when  the  task  completes  execution,  it  must  be  destroyed. 

7.2.1.  Starting  Tasks 

Creating  and  beginning  the  execution  of  a  task  is  a  relatively  straight¬ 
forward  operation.  When  the  user-defined  bootstrap()  routine  creates  a  task  by 
executing  the  mktask()  (llb.c)  routine,  Simon  creates  and  initializes  a  data 
structure  for  the  task  called  a  "task  control  block”,  or  tcb.  The  information 
maintained  in  the  tcb  includes  the  name  of  the  task  (a  user  defined  character 
string),  the  status  of  the  task  (e.g.  running,  or  blocked  and  waiting  for  data  -  if 
the  task  is  blocked,  then  this  means  it  is  blocked  in  the  real  system),  pointers  to 
saved  state  information,  a  pointer  to  the  code  for  the  task,  and  various  statistics 
on  the  execution  of  the  task. 

The  mktask()  (lib.c)  routine  specifies,  among  other  things,  an  integer  called 
the  "id"  of  the  task  being  created.  A  task's  id  is  unique,  and  is  used  internally 
by  Simon  to  refer  to  the  task.  Since  task’s  cannot  be  created  dynamically,  task 
id’s  are  never  reused.  Once  the  tcb  is  set  up,  the  task  is  "officially"  created. 


After  mktask()  creates  the  tcb  by  calling  taskcr()  (task  c).  an  inittask  event 
is  scheduled  so  that  the  task  will  eventually  begin  execution.  When  the  inittask 
event  is  processed,  the  task  begins  execution  through  an  ordinary  procedure 
call.  The  inittaskev()  (evfannd.c)  handles  the  event.  The  mechanics  of  perform¬ 
ing  this  function  were  described  earlier  in  the  section  in  inittask  events. 

7.2L2.  Stopping /  Restarting  Tasks 

Stopping  and  restarting  the  execution  of  a  task  requires  some  fancy  "foot¬ 
work”  on  the  part  of  Simon.  This  is  because  a  certain  amount  of  information 
must  be  saved  when  the  t*\sk  stops  executing,  and  restored  when  the  task  is  res¬ 
tarted.  The  information  wuich  must  be  handled  this  way  depends  on  the  host 
computer,  and  the  program  used  to  compile  application  programs.  Here,  we  will 
restrict  our  discussion  to  the  C  compiler  (used  at  Berkeley)  and  the  VAX  host 
computer.  In  the  discussion  which  follows,  it  is  important  that  the  reader  distin¬ 
guish  between  a  task  defined  by  Simon,  and  a  process  running  under  the  host 
computer.  Simon  may  contain  several  tasks,  however  to  the  host  computer  and 
operating  system,  it  is  run  as  a  single  user  process. 

A  C  program  executing  on  the  VAX  uses  the  VAX's  runtime  stack  to  hold 
local  variables,  as  well  as  information  indicating  the  dynamic  structure  of  pro¬ 
cedure  calls  (so  the  correct  procedure  can  be  executed  when  the  current  one 
returns).  A  program  executing  as  a  single  process  uses  a  single  runtime  stack. 
Thus,  the  various  tasks  running  under  Simon  all  share  the  same  runtime  stack. 
When  a  task  stops  execution,  the  current  runtime  stack  must  be  saved,  since 
this  information  will  be  overwritten  by  the  next  task.  Similarly,  when  a  task  is 
restarted,  it's  runtime  stack  must  be  restored  before  execution  can  be 
resumed. 

In  addition  to  the  runtime  stack,  other  information  must  be  saved/restored 
if  multiple  instances  of  a  task  are  created  from  the  same  program.  Besides 
local  variables,  each  task  has  a  number  of  global  variables.  The  C  compiler 
assigns  each  global  variable  to  a  private  memory  location.  As  long  as  only  one 
instance  of  a  procedure  is  created,  the  memory  locations  assigned  to  the  globals 


remain  unshared,  so  it  is  not  necessary  to  save/restore  them  when  tasks  are 
stopped/restarted.  If  several  tasks  are  created  from  the  same  program  how¬ 
ever,  all  will  use  the  same. memory  locations  for  their  global  variables,  so  it  is 
necessary  to  save/restore  them.  Unfortunately,  Simon  cannot  easily  determine 
where  these  variables  are  located.  Thus,  it  is  up  to  the  user  (via  the  init()  (lib.c) 
routine)  to  tell  Simon  where  its  global  variables  are  located  if  several  tasks  are 
created  from  the  same  program.  It  might  be  noted  that  dynamic  variables 
created  by  e.g.  the  callocQ  program  need  not  be  saved/restored,  even  if  multi¬ 
ple  instances  of  a  task  are  created.  This  is  because  a  private  copy  is  created  for 
each  task  after  the  task  begins  execution.  Since  these  memory  locations  are 
not  shared  by  other  tasks,  no  saving/restoring  is  necessary. 

Thus,  stopping  the  execution  of  a  task  causes  the  task’s  runtime  stack,  and 
perhaps  its  global  variables,  to  be  saved  in  the  task's  tcb  (actually  a  pointer  to 
the  saved  area  is  kept  in  the  tcb).  This  state  must  be  restored  before  the  task 
can  be  restarted.  Let  us  consider  the  situation  in  which  a  task  does  a  get  on  an 
empty  flfo,  and  must  temporarily  stop  executing.  The  get()  (lib.c)  schedules  a 
get  event,  and  then  calls  the  save()  (task.c)  routine.  Save()  saves  the  runtime 
stack,  and  global  variables  if  necessary.  Now,  we  would  like  to  return  to  the 
main  program  at  the  point  where  it  s tarts/restarts  tasks,  so  that  we  can  go  on 
to  process  the  next  event  in  the  event  queue.  Save()  cannot  simply  execute  a 
return  however,  since  this  will  take  us  back  into  the  get()  routine.  In  order  to 
accomplish  this,  the  return  address  information  in  the  runtime  stack  is 
overwritten,  so  that  instead  of  returning  to  save,  we  pop  off  all  or  the  stack 
frames  for  the  task,  and  return  to  the  main  program  (recall  that  all  tasks  are 
started/restarted  from  the  same  point  in  the  main  program).  The  resume() 
(ta*k.c)  performs  this  covert  deed.  When  it  returns,  the  main  program  is  now 
executing,  rather  than  the  program  which  called  it,  save(). 

When  It  is  time  to  resume  execution  of  the  task,  the  restoreQ  (task.c)  rou¬ 
tine  is  called.  All  starting/restarting  of  tasks  are  performed  by  this  routine.  To 
restart  the  task  which  did  the  get(),  the  saved  runtime  stack  is  loaded  back  into 
the  real  runtime  stack,  and  globals  are  restored  if  necessary.  This  is  a 


dangerous  business,  because  in  restoring  the  stack,  the  stack  frame  of  the 
restoreQ  routine  is  overwritten,  destroying  its  local  variables!  Nevertheless,  a 
return  is  now  executed.  Since  the  stack  now  reflects  the  state  of  the  stack  at 
the  time  at  which  save()  was  saving  the  stack,  the  return  executes  as  if  save  did 
the  return,  i.e.  we  are  now  suddenly  back  in  the  get()  routine  at  the  point  just 
after  the  call  to  save.  From  the  viewpoint  of  the  task  executing  the  get()  rou¬ 
tine,  it  simply  called  save(),  and  sometime  later  execution  returned  from  save. 
Thus,  the  save  routine  acts  as  a  no-op  to  the  program  which  executes  it.  Of 
course,  in  the  time  between  the  call  to  saveQ  and  its  return,  thousands  of  other 
events  may  have  been  processed. 

From  the  description  presented  above,  it  is  clear  that  the  saving/restoring 
of  a  task  is  a  tricky  business  using  certain  ’’unconventional’*  coding  practices. 
Modifying  the  save(),  resume(),  and  restoreQ,  routines  should  only  be  done  after 
a  thorough  understanding  of  its  workings  has  been  achieved.  Seemingly  harm¬ 
less  changes,  e.g.  altering  the  order  in  which  variables  are  declared  in  the 
resumeO  routine,  can  have  disastrous  consequences. 

7.2.3.  Terminating  Execution  of  Tasks 

Terminating  the  execution  of  a  task  is  straightforward.  Excluding  execution 
errors  which  terminate  the  entire  simulation  run,  a  task  may  finish  execution  by 
either  returning  from  its  main  program,  or  by  executing  the  stopQ  (lib.c)  rou¬ 
tine.  Returning  to  the  main  program  causes  the  runtime  stack  to  be  popped  by 
normal  procedure  returns,  so  control  is  returned  to  Simon  as  a  consequence  of 
returning  from  the  procedure  call  which  initially  caused  the  task  to  begin  execu¬ 
tion.  Execution  of  the  stop()  (lib.c)  routine  causes  the  stack  to  be  popped  in  a 
manner  similar  to  that  used  in  saving  the  stack,  except  the  stack’s  contents  are 
not  saved  since  the  task  will  not  be  restarted.  In  any  case,  execution  resumes 
as  if  a  return  was  executed  from  the  initial  call  which  began  execution. 

7.2.4.  Other  Task  Management  Routines 

The  remaining  routines  for  task  management  (task.c)  provide  various  book¬ 
keeping  functions,  e.g.  keeping  track  of  which  tasks  are  blocked  on  which  ftfos, 
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or  provide  information  about  tasks,  e.g.  the  name  of  a  task  given  its  id.  Finally, 
the  prtaskO  routine  prints  out  summary  information  of  the  execution  of  all 
tasks.  This  routine  is  normally  called  only  once,  at  the  end  of  the  simulation 
run. 

7.3.  flfaa  and  Virtual  Circuits 

The  flfos  represent  the  interface  which  carries  messages  between  tasks  and 
the  switch  model.  Each  import/export  fifo  pair  of  the  same  name  corresponds 
to  a  virtual  circuit  The  fifo.c  file  contains  routines  to  create  import  and  export 
flfos,  to  add  (remove)  elements  to  (from)  flfos,  to  test  the  status  of  flfos  (full, 
empty,  etc.),  and  to  gather  end-to-end  traffic  statistics.  The  implementation  of 
the  fifo  management  routines  is  described  fully  in  [JohnBl]. 


The  file  taslcc  creates  the  notion  of  tasks,  and  defines  certain  operations, 
e.g.  saving  state,  which  can  be  performed  on  them.  Each  task  is  identified  with  a 
unique  id.  Other  modules  need  not  know  how  the  various  functions  are  per¬ 
formed  within  the  task  module,  but  only  need  to  be  concerned  with  the  interface 
it  provides.  Similarly,  msg.c  is  a  module  defining  how  messages  are  dealt  with. 

Each  message  is  identified  by  a  unique  id,  called  the  message  id.  Unlike 
task  ids  however,  a  message  id  remains  unique  only  as  long  as  the  message 
remains  in  existence.  Once  the  message  arrives  at  its  final  destination,  the  mes¬ 
sage  id  may  be  reused.  This  is  necessary  for  efficiency  reasons,  since  unlike 
tasks,  messages  come  and  go  at  a  relatively  high  rate. 

When  one  processor  sends  a  message  to  another,  a  message  must  be 
created,  and  placed  into  an  export  fifo  (actually  only  the  message  id  is  placed 
into  the  buffer  since  this  is  the  interface  provided  by  the  message  module).  The 
data  portion  of  the  message  must  be  copied  into  a  separate  buffer.  A  simple 
pointer  to  the  task's  variable  cannot  be  used  because  the  task  might  change  this 
variable  before  the  message  arrives  at  it's  destination,  causing  the  new  value  to 
be  transmitted  rather  than  the  value  which  existed  when  the  message  was  sent. 


When  the  message  reaches  its  destination,  the  message  id  is  placed  into  an 
import  fifo.  When  the  receiving  task  reads  the  message  (via  the  get()  (lib.c)  rou¬ 
tine),  the  contents  of  the  message  is  copied  into  one  of  receiver's  variables,  and 
the  message  is  destroyed.  The  id  is  now  free  to  be  used  by  newly  created  mes¬ 
sages. 

The  simple  scheme  described  above  is  complicated  somewhat  by  the  fact 
that  Simon  allows  broadcast,  Le.  multiple  destination  communications.  If  a  mes¬ 
sage  is  broadcast  to  several  destinations,  it  is  wasteful  to  create  a  separate  copy 
and  message  id  for  each  destination,  so  only  one  copy  is  created.  A  count  is 
associated  with  the  number  of  copies  existing  in  the  real  system.  When  the  mes¬ 
sage  is  created,  the  count  field  of  the  created  message  is  set  to  equal  the 
number  of  destination  processors.  When  a  copy  of  the  message  arrives  at  its 
destination,  the  count  is  decremented.  The  message  is  destroyed  and  its  id 
released  when  the  count  becomes  zero.  Note  that  this  scheme  allows  different 
physical  messages  in  the  network  to  have  the  same  message  id,  however  this 
does  not  lead  to  any  inconsistencies  within  Simon. 

Each  message  currently  in  existence  has  a  message  descriptor  associated 
with  it.  This  descriptor  is  analogous  to  the  tcb  allocated  to  each  task.  Message 
descriptors  include  such  information  as  the  size  of  the  message,  location  of  its 
buffer,  time  at  which  it  was  created,  number  of  copies,  etc. 

Messages  sure  created  and  their  contents  copied  into  a  buffer  through  the 
mkmsgO  (nug.c)  routine.  RmmsgQ  (msg.c)  destroys  messages,  and  cpmsgO 
(xnsg.c)  copies  the  contents  of  the  buffer  for  the  message  into  the  receiver’s 
data  area.  Other  routines  are  provided  to  extract  information  about  a  particu¬ 
lar  message,  e.g.  its  size. 

7.5.  Timing  Model 

The  timing  model  consists  of  two  components.  The  first  is  the  simcc  com¬ 
piler  which  inserts  instructions  into  the  application  program  to  continuously 
update  a  clock  as  the  program  executes.  The  second  is  the  portion  of  Simon 
responsible  for  maintaining  the  current  clock  of  each  task.  The  implementation 


of  these  two  components  will  now  be  discussed. 

7.5.1.  Stance 

First,  the  simcc  program  is  the  modified  version  of  the  cc  program,  the 
front  end  for  the  C  compiler.  The  cc  program  calls  the  compiler  to  compile  the 
program.  The  output  generated  by  the  C  compiler  is  an  assembly  language  ver¬ 
sion  of  the  program  being  compiled.  The  assembler  then  creates  an  object  file. 
The  simcc  program  inserts  a  filter  between  the  C  compiler  and  the  assembler. 
This  filter  is  called  the  ccsf  (C  Compiler  Statistics  Filter)  program.  The  output 
generated  by  ccsf  is  the  assembly  language  program  input  to  it  with  instructions 
which  increment  a  global  variable  called  “clocks*  Conceptually,  an  add  instruc¬ 
tion  could  be  inserted  before  each  assembly  language  instruction  to  increment 
dock_by  an  amount  equal  to  the  time  to  execute  the  instruction  which  follows. 
With  a  little  thought  however,  it  is  easy  to  see  that  this  scheme  can  be  improved 
upon:  a  block  of  assignment  statements  in  which  the  only  entry  is  into  the  first 
statement,  and  the  only  exit  is  from  the  last  statement  may  be  preceded  by  an 
add  instruction  which  increments  the  clock  by  an  amount  equal  to  the  execution 
time  of  the  block  of  statements.  This  latter  scheme  reduces  the  amount  of  over¬ 
head  required  to  increment  clock_. 

This  is  the  approach  taken  by  ccsf.  The  ccsf  program  also  allows  the  user  to 
specify  the  execution  time  of  each  VAX  instruction,  and  computes  the  execution 
time  of  a  block  of  instructions  by  simply  computing  the  arithmetic  sum  of  the 
execution  times  of  the  individual  instructions.  It  turns  out  however,  that  one 
cannot  arbitrarily  insert  add  instructions  into  the  instruction  stream,  since 
doing  so  may  cause  the  condition  codes  on  the  host  computer  to  change.  The 
ccsf  program,  which  is  VAX  specific,  inserts  instructions  which  first  save  the  con¬ 
dition  codes  on  the  runtime  stack,  then  inserts  an  add  instruction,  and  finally 
inserts  instructions  which  restore  the  condition  codes.  It  is  interesting  to  note 
that  the  final  instruction  which  restores  the  condition  codes  is  the  '‘return  from 
interrupt"  instruction. 
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7.5.2.  Timing  Model  Within  Simon 

The  interface  between  Simon  and  the  task's  timing  model  (as  set  up  by 
simcc)  is  the  variable  clock_.  When  a  task  first  begins  execution,  clock_is  set 

to  0.  As  the  task  executes,  clock _ is  incremented  via  the  inserted  instructions. 

When  the  task  stops  executing  (e.g.  because  it  must  wait  on  a  get),  the  value  of 
clock_is  save  in  the  task’s  tcb.  Clock— is  restored  to  this  value  when  the  task  is 
restarted. 

Two  timing  models  exist.  In  order  to  avoid  anomolous  effects  caused  by 
floating  point  arithmetic,  clock_is  implemented  as  an  unsigned  integer  (32  bits 
on  the  VAX)  variable.  This  means  it  can  grow  as  large  as  approximately  4  billion 
time  units.  If  the  time  unit  is  nanoseconds,  this  allows  each  task’s  clock  to  run 
up  to  approximately  4  seconds.  With  a  microsecond  time  unit,  clocks  may  reach 
values  exceeding  4,000  seconds.  The  nanosecond  timing  model  is  in  the  file 
timen.c,  while  the  microsecond  model  is  in  timeu.c.  Only  one  of  these  two  files 
is  used  on  each  simulation  run. 

Except  for  the  switch  model,  the  interpretation  of  time  units  is  arbitrary. 
Le.  since  the  only  real  time  in  Simon  is  that  expired  by  the  tasks,  no  incon¬ 
sistencies  result.  In  the  switch  model  however,  delays  refer  to  some  time  unit, 
so  this  module  must  know  whether  the  time  unit  it  is  to  compute  for  (say)  delay 
is  in  microseconds  or  nanoseconds.  To  handle  this  situation,  each  timing 
module  provides  a  number  of  conversion  routines  so  to  convert 
microseconds/nanoseconds  to  whatever  time  unit  the  rest  of  the  simulator  is 
using.  This  gives  the  switch  models  the  flexibility  to  use  whatever  time  unit  they 
prefer,  while  remaining  consistent  with  the  rest  of  the  simulator  in  a  tran¬ 
sparent  fashion. 

In  addition  to  conversion  routines,  the  timing  modules  provide  a  number  of 
other  routines  for  initializing,  setting,  starting,  stopping,  and  reading  a  task’s 
clock.  Care  must  be  taken  in  manipulating  these  clocks  to  ensure  that  time 
does  not  flow  backwards!  Should  this  occur,  the  simulator  will  be  compromised, 
and  unpredicatable  behavior  results. 


7.0.  Switch  Model  Interface 

Now  that  the  events  have  been  defined,  the  interface  provided  to  the  switch 
model  can  be  explained.  From  the  discussion  above,  it  is  clear  that  only  three 
event  —  the  send,  arrive  and  initsw  events  —  pertain  to  the  switch  model. 
Minimally,  the  switch  model  must  provide  three  routines: 

(1)  initswQ  (swmod.c). 

(2)  sendev()  (awmocLc). 

(3)  switchev()  (mnodc). 

The  initswQ  (swmodLc)  routine  is  called  once  at  the  begining  of  the  simula¬ 
tion  run.  It  indicates  such  parameters  as  the  maximum  number  of  tasks  that 
will  ever  be  created,  maximum  number  of  import  and  export  fifos,  etc.  It  is 
responsible  for  initializing  any  data  structures  used  by  the  switch  model.  Also, 
it  is  here  that  the  switch  can  input  any  information  about  the  communication 
network,  e.g.  the  speed  of  communication  links,  topology  of  the  network,  etc.  An 
input  file  pointer  is  passed  to  the  routine.  This  pointer  points  to  a  file  descriptor 
for  the  file  specified  in  the  command  line  which  invoked  Simon  if  the  -s  option 
was  used.  Otherwise,  the  standard  input  is  used.  The  -s  flag  should  always  be 
used  to  input  switch  model  information,  since  application  programs  often  use 
the  standard  input  to  input  their  own  data 

The  sendev()  (ismodc)  routine  is  called  to  process  a  send  event  which  was 
scheduled  by  a  task  executing  either  the  putO  (lib.c)  or  putsO  (lib.c)  routines. 
The  routine  is  passed  information  indicating  the  time  at  which  the  message  was 
sent,  the  task  sending  the  message,  a  list  of  tasks  which  are  to  receive  the  mes¬ 
sage,  a  message  identifier  (to  be  discussed  later),  and  the  size  of  the  message  in 
bytes.  Note  that  a  particular  message  may  have  several  destinations  since 
broadcast  communications  are  allowed.  Since  sending  a  message  corresponds 
to  adding  an  entry  to  an  export  flfo,  sendevQ  (swmod.c)  must  remove  the  mes¬ 
sage  from  the  fifo,  and  schedule  an  arrive  event  for  each  destination.  Each 
arrive  event  is  timestamped  to  indicate  the  time  at  which  the  message  arrived 
at  its  destination.  The  difference  between  this  time  the  time  at  which  the  send 
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event  occured  indicates  the  time  required  by  the  interconnection  switch  to 
transmit  the  message. 

It  might  be  noted  that  the  sendev()  (swmocLc)  routine  need  not  directly 
perform  the  functions  described  above.  It  is  the  switch  model  as  a  whole  which 
is  responsible  for  doing  this.  The  sendevQ  (svmodc)  could  schedule  its  own 
internally  defined  events  which  lead  to  removing  the  message  from  the  export 
flfo  and  scheduling  the  associated  arrive  event(s).  In  a  store-and-forward  net¬ 
work  for  example,  intermediate  events  might  indicate  messages  going  from  hop 
to  hop  through  the  network. 

The  switch  model  is  allowed  to  schedule  and  process  it's  own,  internally 
defined  events.  When  such  an  event  appears  at  the  head  of  the  event  queue,  the 
switchev()  (swmocLc)  routine  is  called.  In  general,  the  main  program  calls 
switcbev()  (swmodLc)  whenever  it  removes  an  event  from  the  event  queue  which 
it  does  not  recognize  (the  five  events  listed  earlier  are  the  only  events  recog¬ 
nized  by  the  main  program). 

Finally,  one  other  restriction  is  placed  on  the  switch  model:  messages  sent 
on  the  same  virtual  circuit  from  the  same  task  must  be  scheduled  to  arrive  in 
the  same  order  in  which  they  were  sent.  This  is  because  the  user  interface 
specifies  that  sequentiality  is  preserved  on  each  virtual  circuit.  Violation  of  this 
rule  will  result  in  incorrect  computations  for  application  programs  which 
depend  on  this  feature. 

7.7.  Notes  an  Porting  3m on;  Machine  Dependencies 

Although  work  is  under  way  to  port  Simon  to  other  machines,  at  present,  no 
working  version  is  known  to  exist  on  any  other  machine  than  a  VAX.  There  are 
(at  least)  two  portions  of  Simon  which  must  be  modified  if  it  is  to  be  ported  to 
another  machine.  These  are  the  simcc  compiler,  and  the  task  swapping 
mechanisms. 

Since  the  ccsf  program  scans  through  assembly  language  programs  and 
assigns  execution  times  to  blocks  of  VAX  instructions,  it  is  clearly  VAX  specific. 
A  new  ccsf  program  is  required  for  each  host  computer  Simon  is  run  on.  The 
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function  performed  by  ccsf  remains  the  same,  however  different  instruction 
opcodes  will  have  to  be  recognized,  and  different  instruction  sequences  will  have 
to  be  inserted  to  ensure  that  the  instructions  can  be  inserted  without  disturbing 
the  original  code.  All  of  this  assumes  that  the  compiler  on  the  new  host  gen¬ 
erates  an  assembly  language  program  as  an  intermediate  step.  Without  this, 
incorporating  a  timing  model  which  can  be  used  easily  by  Simon  is  much  more 
difficult. 

As  discussed  earlier,  the  code  in  task.c  for  saving  and  restoring  the  runtime 
stack  of  application  programs  depends  on  the  format  of  the  stack  frame.  This 
depends  on  the  compiler  and  host  machine.  If  Simon  is  ported  to  another 
machine,  these  routines  will  have  to  be  modified  to  use  whatever  format  that 
machine  uses. 

Finally,  a  third  point  of  difficult  may  arise  if  an  attempt  is  made  to  move 
Simon  to  a  machine  which  does  not  support  3 2  bit  integer  arithmetic.  This  is 
the  timing  model.  Without  32  bit  arithmetic,  the  granularity  of  each  time  unit 
may  become  too  corse.  For  example,  if  18  bit  arithmetic  is  used,  clocks  can 
only  reach  a  maximum  value  of  85,000.  With  the  low-precision  microsecond  time 
unit,  clocks  may  only  reaches  values  of  85  milliseconds.  Corser  time  units  are 
required  to  reach  higher  values,  reducing  the  precision  of  the  simulator  even 
further. 
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APPENDIX  1 


This  appendix  outlines  the  routines  defined  within  the  simulator  which  are 
available  for  use  by  the  programmer. 

char  •tenTarr  (base,  i) 

char  *base;  int  i; 

Convert  character  string  p  to  an  array  reference  using  base  name  "base"  and 
index  value  "i". 

char  •cnvarrsv  (base,  i) 

char  ‘base;  int  i; 

Same  ascnvarr()  except  storage  for  the  result  is  allocated  via  ealloc(^ 

exparr  (base,  n) 

char  ‘base;  int  n; 

Create  an  array  of  "n"  flfo’s  with  base  name  "base". 

export  (name) 

char  'name; 

Create  a  single  export  fifo  called  of  name  "name". 

get  (name,  da  tap) 
char  'name,  "datap; 

Get  next  message  from  fifo  "name",  and  store  its  daLa.  where  "datap'1  points. 

int  getclk  (id) 

int  id; 

Returns  time  on  clock  (microseconds)  for  task  whose  id  is  "id”. 

impair  (base,  n,  maxi) 

char  'base;  int  n,  maxi; 

Create  an  array  of  "n"  import  flfo’s  with  base  ‘name  "base”,  each  holding  up  to 
"maxi"  messages  (infinite  if  "maxi"  is  0). 


import  (name,  maxi) 
char  ’name;  int  maxi; 

Create  a  single  import  flfo  with  name  "name",  capable  of  holding  up  to  "maxi 
messages  (infinite  if  "maxi"  is  0). 

init  (glob,  lngth) 

char  *glob;  int  lngth; 

Use  when  multiple  tasks  are  created  from  a  single  program  using  global  or 
static  variables  which  occupy  "lngth"  bytes  starting  at  location  "glob". 

mktask  (name,  cdptr.  id.  parm,  lngth) 

char  "name,  •parm;  int  (•cdptr)(),  id,  lngth; 

Create  a  task  called  "name”,  and  assign  it  to  task  id  "id".  The  block  of  memory 
"lngth"  bytes  long  starting  at  "parmp"  is  passed  to  the  procedure  "cdptr"  when 
the  task  starts  executing. 

int  overtl  (name) 

char  ‘name; 

Return  and  reset  overflow  flag  for  import  flfo  "name". 

put  (name,  da  tap,  lngth) 

char  "name,  *datap;  int  lngth; 

Put  "lngth”  bytes  of  data  starting  at  location  "datap"  into  export  flfo  "name". 

puts  (name,  datap,  lngth) 

char  'name,  “datap;  int  lngth; 

Same  as  put(),  but  send  message  to  self  as  well. 

int  qlength  (name) 

char  “name; 

Return  number  of  messages  currently  in  import  flfo  "name". 


int  size  (name) 
char  ’name; 

Return  size  of  first  message  in  import  fifo  "name",  or  0  if  fifo  is  empty. 


*tx>pO 

Terminate  execution  of  task. 

int  taskQ 

Returns  the  calling  task's  id. 

char  •taskname  (id) 

int  id; 

Returns  a  pointer  to  the  name  of  the  task  whose  id  is  "id". 

wait/  (names,  aj 

char  “names;  int  n; 

Wait  for  data  to  arrive  in  one  of  the  "n"  import  flfo’s  specified  by  "names". 
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This  appendix  describes  the  mechanics  of  using  the  simulator.  All  object 
flies  may  be  found  in  the  directory  (on  ESVAX  and  CSV  AX)  ~fujimoto/S.  Example 
user  programs  are  in  the  directory  "“fujimoto/SlMEX.  All  test  programs  should 
be  compiled  with  simcc  ("“fujimoto/BIN/simcc)  rather  than  cc. 

The  ”S”  directory  contains  a  set  of  object  files.  To  use  the  simulator,  link 
these  object  flies  to  your  test  programs.  The  simulator’s  object  files  and  the 
functions  they  perform  are  listed  below. 


abject 

file 

function 

evhand.o 

event  handler  routines  (called  by  mainsim) 

evnt.o 

event  queue  primitives 

fifao 

fifo  management  routines 

lib.o 

user  callable  routines 

mains  im.o 

main  program  (initialization  and  main  loop) 

msg.o 

message  management  routines 

task.o 

task  management  (saving/restoring  state,  etc.) 

time.o 

task  timing  model 

utiLo 

miscellaneous  utility  routines 

swmod.o 

null  switch  model  (zero  latency  transmissions) 

ether,  o 

ethemet  switch  model 

Note  that  the  last  two  object  files  are  switch  models.  Only  one  of  these 
should  be  loaded  on  each  simulation  run. 


This  appendix  describes  the  parameters  which  can  be  specified  in  the  com¬ 
mand  line  when  invoking  the  simulator.  The  parameters  the  simulator  will 
accept  are  listed  below. 

type  of 

flag  default  meaning 

parameter 


-nt 

integer 

200 

-nm 

integer 

2000 

-nn 

integer 

1000 

-ni 

integer 

1000 

-ne 

integer 

1000 

-t 

<none> 

off 

-o 

file 

stdout 

-3 

file 

stdin 

max  nmbr  of  tasks 
max  nmbr  of  msgs  at  one  time 
max  nmbr  of  different  fifo  names 
max  nmbr  of  import  fifo's 
max  nmbr  ot  export  fifo's 
generate  traffic  statistics 
output  file 

input  file  to  switch  model 


For  example,  to  increase  the  maximum  number  or  tasks  to  400  while  divert¬ 
ing  output  to  the  file  foo,  enter: 

%sim  -o  foo  -nt  400 
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This  appendix  describes  bow  to  use  the  modified  version  of  the  C  compiler. 
The  simulator's  timing  model  requires  that  all  user  programs  be  compiled  with 
this  compiler,  called  simcc  (~fujimoto/BIN/simcc).  In  addition  to  all  of  the 
options  provided  in  cc,  simcc  provides  a  -T  option  for  specifying  the  execution 
times  of  VAX  assembler  instructions.  When  specified,  the  -T  is  followed  by  the 
name  of  a  file  which  lists  these  execution  times  in  the  format  described  below. 

Thus,  if  your  execution  times  are  in  the  file  "time’',  you  would  say: 

Xsimcc  -T  time  ... 

to  compile  your  program,  where  "..."  is  other  options  and  names  of  files  being 
compiled.  Note  that  the  blank  after  "-T”  must  be  present.  If  the  "-T"  option  is 
not  specified,  the  default  execution  times  will  be  used  (all  instructions  execute 
in  one  microsecond). 

The  file  "time"  consists  of  a  list  of  pairs,  with  each  element  of  the  pair 
sepe rated  by  blank(s)  and/or  tab(s),  and  successive  pairs  seperated  by  blank(s), 
tab(s),  and/or  newline(s)  (elements  of  a  pair  must  be  on  the  same  line).  The 
first  element  of  a  pair  is  the  mneumonic  for  a  VAX  assembler  instruction.  The 
second  element  is  an  integer  specifying  the  execution  time  of  that  instruction  in 
NANOSECONDS.  An  example  "time"  file  is: 

pushl  200 
calls  200 
addl3  200 
muld3  200 
movd  200 

which  specifies  that  the  five  instructions  above  execute  in  200  nanoseconds. 
Note  that  mneumonics  should  all  be  in  small  letters,  and  the  execution  time  is 
an  integer  with  NO  decimal  point.  All  instructions  besides  these  five  will  be 
assigned  the  default  execution  time  (which  is  still  set  at  1000,  or  1 
microsecond). 
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SELF -CHECKING  VLSI  BUILDING  BLOCKS  FOR  FAULT  -  TOLERANT  MULTI  COMPUTERS 


Yuval  Tamir  and  Carlo  H.  Sdquin 

Computer  Science  Diviiion 
University  of  California.  Berkeley,  CA  94720 


Abatract 

The  use  of  self-checking  nodes  and  links  for 
implementing  fault-tolerant  VLSI  multicomputers  is 
proposed.  The  system  is  composed  of  a  large  number  of 
VLSI  computers  Interconnected  by  high-speed  dedicated 
links.  Hardware  that  performs  error  detection  is 
combined  with  system-level  protocols  that  handle  error 
recovery  and  rault  treatment. 

The  self-checking  nodes  notify  the  rest  of  the 
system  when  their  output  is  erroneous.  In  order  to 
achieve  high  fault  coverage,  error  detection  is 
accomplished  by  duplication  and  matching.  The  critical 
circuit  in  this  scheme  is  a  comparator  which  must  not  be 
susceptible  to  faults  that  can  remain  undetected  and 
later  mask  the  failure  of  the  functional  modules.  With 
both  NMOS  and  CMOS  technologies  it  is  possible  to 
implement  a  si  If -t  is  ting  comparator  that  will  produce 
an  error  indication  if  it  incurs  any  single  physical  defect. 


latreductlao 

High  reliability  and  high  performance  are  primary 
|oals  or  most  computer  systems.  There  are  fundamental 
limits  to  the  Increases  in  reliability  and  performance 
that  are  achievable  by  improvements  in  technology 
alone.  The  limits  on  performance  can  be  overcome  by 
exploiting  parallelism  while  the  limits  on  reliability  can 
be  overcome  by  using  fault  tolerance  techniques. 

Parallelism  can  be  exploited  by  a  system  that 
consists  of  a  large  number  of  computation  nodes,  each 
able  to  execute  a  subtask  of  the  problem  being  solved.  A 
possible  architecture  for  such  a  system,  which  is 
compatible  with  the  constraints  of  VLSI,  Is  to 
interconnect  these  computation  nodes  by  bigb-spaed 
dedicated  links  and  communication  nodes  that  provide 
hardware  support  for  communication  functions  such  as 
Hi**sage  routing.  Each  computation  node  is  connected 
to  one  of  the  communication  nodes.  A  communication 
nods  has  several  ports  through  which  It  is  connected  to 
computation  nodes  and  other  communication  nodes?  We 
esll  such  a  system  a  multicomputer.  The  nodes  and 
*re  components  (building  blocks)  that  can  be  used 
to  construct  multicomputers  with  a  wide  range  of  sizes 
and  topologies. 

System  failure  occurs  when  the  multicomputer 
doesn't  perform  according  to  Its  specifications  at  Its 
interface  with  the  "outside  world!*  System  failure  Is 
often  the  result  of  a  failure  of  one  of  its  components. 

tolerance  techniques  attempt  to  prevent 
component  failure  from  leading  to  system  failure.1  A 
multicomputer  is  especially  well  suited  for  reliability 
enhancement  using  fault  tolerance  techniques  since  It  is 
partitioned  into  independent  and  "intelligent" 
components  (the  computation  and  communication 


nodes).  Fault-free  components  can  adapt  to  changes  in 
faulty  components  and  continue  their  operation  in  a  way 
that  leads  to  correct  system  output  despite  the  fault. 

A  brief  overview  of  techniques  for  implementing 
fault  tolerance  in  a  multicomputer  is  presented  and  the 
considerations  that  lead  to  the  choice  of  an  approach 
based  on  self-checking  components  are  discussed. 
Duplication  and  matching  is  shown  to  be  an  effective 
practical  technique  for  implementing  nodes  that  are 
self-checking  with  respect  to  any  likely  fault. 

Implementation  of  Hivhlv  Rail  able  Multlcomnutera 

The  reliability  of  any  system  can  be  enhanced  by 
increasing  the  reliability  of  its  components  through  fault 
prevention1  techniques  such  as  specialized  design 
methodologies,  stringent  quality  control,  and  extensive 
validation  and  testing.  These  techniques  typically  result 
in  more  complex  designs,  greater  cost,  and  lower 
performance.  Furthermore,  the  effectiveness  of  these 
techniques  is  limited  by  our  inability  to  exhaustively  test 
complex  VLSI  chips? 

Alternatively,  the  reliability  of  the  components  can 
be  increased  by  employing  fault  tolerance  techniques. 
These  techniques  attempt  to  ensure  that  each 
component  wili  continue  to  perform  according  to  its 
specifications  despite  faults.  Unfortunately,  no 
component  can  tolerate  an  unbounded  number  or  faults. 
The  contamination  of  the  system  by  incorrect  output 
from  a  faulty  component  can  be  prevented  only  if.  at 
some  stage,  other  system  components  find  out  about  the 
failure  of  the  component  and  physically  or  logically 
isolate  it  from  the  rest  of  the  system. 

At  the  system  level,  software  (protocols)  can  be 
used  to  detect  and  recover  from  the  failure  of 
components.  For  example,  identical  tasks  may  be 
assigned  to  three  nodes  and  a  "majority  vote"  taken  on 
the  results.  One  of  the  problems  with  this  approach  is 
that  if  the  results  conflict.  It  may  be  very  costly  or 
impossible  to  locate  the  cause  of  the  discrepancy. 
Additional  problems  are  the  high  overhead  In 
computation  resources  and  communication  bandwidth 
and  difficulties  in  effectively  handling' transient  faults. 

If  a  node  fails  due  to  a  transient  fault,  it  should  be 
reset  to  a  "sane  state"  and  remain  active  rather  than  be 
removed  from  the  system.  If  neighboring  nodes  are 
responsible  for  detecting  such  a  failure,  they  must  be 
given  the  authority  to  initiate  the  reset.  However,  this 
authority  also  allows  a  /ailed  node  to  reset  operational 
neighbors.  In  order  to  prevent  this  situation,  each  node 
must  be  responsible  for  its  own  reset.  Hence,  the  node 
should  include  a  mechanism  to  detect  its  own  erroneous 
tlalit1  and  Initiate’  the  reset. 

Some  of  the  deficiencies  with  the  aforementioned 
techniques  can  be  overcome,  by  implementing  fault 
tolerance  in  a  VLSI  multicomputer  using  hardware  error 
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defection  in  conjunction  with  system  level  protocols 
which  perform  error  recovery  end  fault  treatment.1 
Errors  caused  by  faults  in  the  communication  links  are 
detected  through  the  use  of  error-detecting  codes.  All 
nodes  are  self-checking  and  signal  to  the  rest  of  the 
system  when  their  output  is  incorrect  so  that  it  will  not 
be  accepted  as  correct.  In  addition,  failed  nodes 
attempt  to  reset  themselves  and  reestablish  a  sane 
state.  The  immediate  neighbors  are  informed  whenever 
a  node  fails.  If  the  node  doesn't  reset  itself  or  fails  too 
often,  the  neighbors  can  logically  remove  it  from  the 
system  by  refusing  to  communicate  with  it.  The 
diagnostic  status  information  is  distributed  throughout 
the  system  so  that,  eventually,  no  fault-free  node  will 
attempt  to  use  the  faulty  component. 


For  all  likely  faults,  a  self-checking  component  must 
either  produce  the  "correct"  outputs  or  somehow 
indicate  that  the  outputs  are  incorrect.  A  component 
that  satisfies  this  requirement  is  said  to  be  fault 
secure.10  If  the  component  Is  not  guaranteed  to  produce 
an  error  Indication  immediately  following  the  first  fault, 
it  is  possible  for  several  faults  to  exist  in  the  component 
simultaneously  without  any  indication  to  the  rest  of  the 
system.  Even  if  the  component  is  fault  secure  with 
respect  to  any  single  fault,  several  faults  together  may 
lead  to  the  failure  of  the  self-check  mechanism  and, 
eventually,  to  incorrect  outputs  from  the  component 
being  accepted  as  correct  by  the  rest  of  the  system.  In 
order  to  prevent  this  situation,  the  component  must  be 
se(/-fesfinp.10  In  the  presence  of  one  or  more  faults,  a 
self-testing  component  produces  an  error  indication 
before  additional  faults  can  occur  and  lead  to  the  failure 
of  the  self-check  mechanism.  Components  which  are 
fault-secure  and  self-testing  are  said  to  be  totally  s •!/- 
checking10  {TSC). 

Error  detecting/correcting  codes  can  be  used  to 
implement  TSC  nodes.  Redundant  information  is  carried 
by  busses,  memories,  and  registers  in  order  to  detect 
(and  possibly  correct)  errors.10  Unfortunately,  different 
codirg  schemes  must  be  used  for  different  parts  of  the 
node  The  resulting  increase  in  the  complexity  of  the 
design  and  of  design  verification  and  testing  may  lead  to 
a  circuit  in  which  failure  modes  that  are  more  difficult  to 
predict  and  "tolerate"  are  more  likely  to  occur. 

An  alternative  is  to  construct  the  TSC  computation 
or  communication  node  using  two  identical,  independent 
modules,  each  performing  the  function  of  the  node. 
Inputs  from  neighbor  nodes  are  fed  to  both  modules.  If 
the  modules  operate  synchronously,  their  outputs  should 
always  be  Identical.  Except  for  the  nearly-impossible 
case  where  botb  modules  produce  Identical  incorrect 
output,  an  error  can  be  detected  by  a  comparator  which 
is  part  of  the  node.  The  output  of  the  comparator  is 
connected  to  neighboring  nodes  through  dedicated 
wires.  The  output  from  one  of  the  two  modules  is  the 
"functional"  output  from  the  node  (Fig.  1).  A  "no- 
match"  signal  from  the  comparator  Is  used  locally  as  a 
reset  signal  and  Is  also  sent  to  all  neighbors  as  a  failure 
indicator.  Similar  failure  indicators  from  the  neighbors 
cause  an  interrupt  and  invoke  system-level  routines  that 
handle  the  node  failure. 

Implementing  the  TSC  property  in  a  component 
using  duplication  and  matching  may  appear  wasteful 
since  It  more  than  doubles  the  required  hardware. 
However,  this  scheme  becomes  more  attractive  when 
issues  such  as  design  complexity,  fault  coverage, 


Function*!"  .  Failure  Neighbor  s  "Functional ' 

Output  Indicator  Status  Input 

Fig.  1:  A  Stlf-Chtcking  Computation  Soda 


reliability  prediction,  and  the  ability  to  recover  from 
transient  faults  are  taken  into  account.  Traditional  fsult 
models  are  not  valid  for  VLSI0' 12  As  a  result,  low-cost 
error  detection  schemes,  which  are  based  on  then 
models,  may  no  longer  be  adequate.  With  duplication 
and  matching,  errors  are  detected  as  long  as  the 
comparator  remains  functional  and  the  two  modules 
produce  different  outputs  the  first  time  one  or  both  of 
them  fail.  Since  a  faulty  comparator  can  mask  fsully 
functional  modules,  faults  in  the  comparator  must  not  go 
undetected,  is.,  the  comparator  must  be  self-testin| 
Thus  a  detailed  analysis  of  the  effects  of  all  likely  faults 
on  tbe  comparator  is  required. 


The  design  of  self-checking  circuits  requires  an 
understanding  of  the  physical  defects  which  commonly 
occur  in  VLSI  and  of  the  resulting  logical  faults.  In  tht 
past  the  stuck-at  fault  model  has  been  widely  used  ta 
model,  at  the  logical  level,  the  effects  of  physicsl  defects 
in  circuits.  This  model  does  not  cover  many  of  the 
possible  defects  in  VLSI?- *• 12  The  fabrication  flaws  and 
physical  processes  that  can  cause  malfunction  of  NMOS 
and  CMOS  VLSI  circuits  are  summarized  in  this  section. 

VLSI  chip  failures  may  be  caused  by  design  or 
fabrication  flaws,  may  be  due  entirely  to  environmental 
factors,  or  are  the  end  result  of  a  degenerative  process 
due  to  operational  and  environmental  stresses  but 
partially  attributable  to  design  or  manufsetunng 
defects?'10  Fabrication  defects  in  MOS  chips  consist 
mainly  of  shorts  and  opens  in  each  interconnection  level 
shorts  between  different  levels,  and  large  imperfection* 
such  as  scratches  across  the  chip?  Other  fabrication 
defects  include  incorrect  dosage  of  ion  implants,  contact 
windows  that  fail  to  open,  misplaced  or  defective  bond* 
and  penetration  of  the  package  by  contaminants4 
'During  the  operation  of  the  chip,  faults  may  be  caused 
by  electromigration,  corrosion,  electrical  breakdown  « 
oxide,  cracks  due  to  thermal  expansion,  power  supply 
fluctuation,  and  ionizing  or  electromagnetic  radiation 

At  the  logical  level,  most  of  tbe  faults  can  be 
represented  In  a  circuit  model  that  consists  of  s  network 
of  switches,  loads  (for  NMOS),  and  interconnection  bnr* 
which  directly  correspond  to  the  transistors 
interconnections  in  the  actual  circuit?  Most  of  the 
physical  defects,  such  as  opens  and  aborts.,  esn  be 
represented  in  this  model  in  an  obvious  way?  A  "switch 
may  be  permanently  on  or  permanently 
corresponding  to  a  gate  input  stuck-st-i  * 
respectively.  Shorted  NMOS  loads  (pullup*) 
equivalent  to  an  output  line  s-a-1.  Disconnected  ge*» 
inputs  are  usually  equivalent  to  s-a-0  or  s-a-l  fsult*. 
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Sore*  physical  defects  have  a  more  complex  effect 
to  the  circuit  In  NVOS,  incorrect  dosage  of  ion  implants 
tuj  cause  a  threshold  shift  in  a  load  transistor.  This 
esa  result  in  an  output  voltage  that  lies  between  the 
friUges  assigned  to  logic  0  and  logic  1.  If  the  fanout 
from  the  gate  is  greater  than  one,  some  of  the  gates 
connected  to  its  output  may  "interpret"  It  as  logic  1 
•bile  others  will  interpret  it  as  logic  0.  If,  at  some  point 
in  time  (clock  cycle),  the  line  is  supposed  to  be  a  logic  1 
but  is  interpreted  by  some  of  the  gates  as  logic  0.  we  call 
it  a  weak  1  fault.  Conversely,  if  the  line  is  supposed  to 
be  a  logic  0  but  is  interpreted  by  some  of  the  gates  as 
logic  1,  w*  call  it  a  weak  0  fault.  A  single  physical  defect, 
resulting  In  a  single  weak  0  or  weak  1  fault,  has  the  same 
affact  as  multiple  s-a-1  or  s-a-0  faults,  respectively. 

In  CMOS,  a  transistor  which  is  permanently  off  or  a 
break  in  a  line  can  result  in  a  high  impedance  stat$ 
where  the  output  of  a  combinational  logic  gate  in' 
dependent  on  the  previous  output  rather  than  the 
current  input.12  Such  a  fault  (called  a  stuck-open  fault) 
may  escape  detection  even  if  all  possible  input  vectors 
are  used  to  test  the  circuit.12 


The  duplication  and  matching  scheme  relies  entirely 
an  a  self-testing  comparator  to  detect  faults  in  the 
functional  modules.  Implementing  such  a  comparator 
requires  knowledge  of  how  different  faults  will  affect  the 
circuit  Fortunately,  a  comparator  is  a  simple  circuit 
that  can  be  implemented  with  a  regular  structure  and  is 
therefore  amenable  to  thorough  analysis.  Hence,  we  can 
bare  confidence  in  our  ability  to  predict  the  likely 
physical  defects,  develop  a  valid  fault  model,  and  prove 
that  the  implementation  we  propose  is  indeed  self- 
testing. 

V*  assume  that  physical  defects  in  the  node  occur 
aas  at  a  time.  A  fault  that  Is  the  result  of  a  single 
physical  defect  is  called  a  tingle  fault.  It  is  assumed 
that  there  is  a  negligible  probability  that  the  time 
interval  between  the  occurrence  of  successive  single 
defects  in  the  comparator  or  between  a  single  defect  in 
the  comparator  and  an  arbitrary  collection  of  defects  in 
the  functional  modules,  is  less  then  some  value  T.  In 
erder  to  ensure  that  faults  in  the  comparator  will  not 
■ask  future  faults  in  the  functional  units,  during  normal 
•peration,  the  comparator  must  "test  itself'  for  any 
■ngle  fault  in  less  than  time  T. 


As  a  first  step  to  constructing  a  comparator  which  is 
••If-testing  with  respect  to  any  single  fault,  we  will 
discuss  the  implementation  of  a  comparator  which  is 
self-testing  with  respect  to  any  single  stuck-at  fault. 

In  this  context  "two-rail"  codes  prove  useful.  They 
consist  of  all  words  (bit  vectors)  such  that  a  specified 
helf  of  the  word  is  the  complement  of  the  other  half.  If 
the  output  of  one  of  the  modules  In  a  self-checking  node 

■  complemented,  a  two-rail  code  checker  can  serve  as  a 
comparator"  that  checks  the  validity  of  the  output. 

**cb  a  checker,  which  la  self-testing  with  respect  to  any 
«J«le  stuck-at  fault,  can  be  implemented  as  a  two  level 
njK-NOR  PLA  (Fig.  2).13-2  The  output  from  the  checker  is 
two-rail  code  that  Is  01  or  10  (code  output)  if 
the  input  is  a  two-rail  code  word  (cod*  input),  and  00  or 
U  (noncode  output)  otherwise  (noncode  input).  It  can 
T*"own  that  if  any  single  stuck-at  Tautt  exists  in  the 
•i  i«r'  Ul,re  18  ®n  input  two-rail  code  word  that  results 

■  •  00  or  11  output,  thereby  "detecting"  the  fault?3 


rig.  2.  An  NUOS  Self-Testing  Two-Fail  Code  Checker 

The  requirement  that  the  checker  must  be  self¬ 
testing  with  respect  to  any  single  stuck-at  fault  poses 
severe  constraints  on  its  implementation.  It  can  be 
shown  that  any  two  level  AND-OR  (or  NOR-NOR) 
implementation  for  an  input  of  2n  bits  (n  bits  from  each 
module)  must  use  2*  product  terms,  one  for  each  code 
input?1  If  the  output  from  each  module  is,  say,  18  bits, 
this  implementation  is  impractical  since  it  requires 
211  =  85538  product  terms.  Furthermore,  all  possible 
(2*)  code  words  must  appear  at  the  checker's  inputs  for 
it  to  perform  a  complete  self-test. 

Several  small  self-testing  two-rail  code  checkers  can 
be  used  as  “cells"  for  constructing  a  self-testing  checker 
for  a  wide  input  word  (Fig.  3).10-®  While  the  self-testing 
property  is  preserved,  the  number  of  input  patterns 
required  for  a  complete  self-test  is  dependent  only  on 
the  si*e  of  the  largest  "cell"® 


Fig.  3:  A  Self-Testing  Two-Fail  Code  Checker  Tree 


The  faults  that  commonly  occur  in  a  MOS  PLA  are 
stuck-at  faults,  shorts  between  adjacent  lines,  breaks  in 
lines,  and  contact  faults  which  include  missing  or  extra 
devices  at  crosspoints.13-7  In  addition,  weak  0/1  faults 
can  occur  on  the  input  or  product  term  lines. 
Fortunately,  it  turns  out  that  a  straightforward  NOR-NOR 
PLA  Implementation  of  the  checker  discussed  above  is 
self-testing  with  respect  to  any  one  of  the 
aforementioned  single  faults.  The  rest  of  this  section 
contains  an  informal  "proof"  of  this  claim;  a  more 
formal  proof  will  be  presented  elsewhere.11  Faults  in  the 
Input  lines,  product  term  lines,  output  lines.  AND  array 
crosspoints,  and  OR  array  crosspoints,  are  considered 
separately. 

Any  single  stuck-at  fauit  or  short  in  the  input  lines 
will  cause  one  or  more  0's  to  change  to  l's  or  one  or 
more  l's  to  change  to  0's  (but  not  both)  for  some  code 
input.  It  can  be  shown  that  such  an  error  (called  a 
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unidirectional  error7)  on  the  input  lines  results  in 
noncode  output13  A  break  in  the  input  line  outside  the 
AND  array  is  equivalent  to  the  line  stuck-at-0  or  stuck- 
at-1.  A  break  in  the  middle  of  the  AND  array  affects  only 
some  product  terms.  Tor  an  affected  product  term,  if 
the  break  is  equivalent  to  a  stuck-at-1,  the  one  code 
input  that  is  supposed  to  select  this  product  term  won't, 
and  a  noncode  output  will  result.  If  the  break  Is 
equivalent  to  a  stuck-at-O.  there  exists  a  code  input  that 
results  in  a  noncode  output  since  it  selects  two  product 
term  lines  each  of  which  Is  connected  to  a  different 
output  line.11 

An  extra  device  in  the  AND  array  is  equivalent  to  the 
corresponding  product  term  stuck-at-0.  The  code  input 
that  is  supposed  to  select  that  product  term  results  in  a 
noncode  output.  If  there  is  a  missing  device  in  the  AND 
array,  there  exists  a  code  input  that  produces  a  noncode 
output  since  it  selects  two  product  term  lines,  each  of 
which  is  connected  to  a  different  output  line.11 

An  extra  device  in  the  OR  array  means  that  one  of 
the  product  terms  is  connected  to  both  outputs.  A 
missing  device  in  the  OR  array  is  equivalent  to  the 
corresponding  product  term  stuck  at-0.  In  either  case, 
the  code  input  that  selects  the  relevant  product  term 
will  result  in  a  noncode  output. 

If  the  output  lines  are  shorted,  their  values  are 
equal  and  that  is  a  noncode  output.  If  one  of  the  lines 
has  a  stuck-at  fault,  there  exists  a  code  input  that 
causes  the  other  line  to  have  the  same  value  so  the 
output  is  noncode.  For  some  code  input,  a  break  in  one 
of  the  output  lines  is  equivalent  to  a  stuck-at-1  or  stuck- 
at-0  fault  on  that  line. 

A  stuck-at-0  fault  on  a  product  term  line  will  result 
in  a  noncode  output  if  the  input  is  the  code  word  that  is 
supposed  to  select  that  product  term  line.  A  stuck-at-1 
fault  on  a  product  term  line  will  result  in  a  noncode 
output  to  any  input  that  selects  a  product  term  line  that 
is  connected  to  the  other  output  line.  A  break  in  a 
product  term  line  is  equivalent  to  a  stuck-at  fault  on  that 
line  since  each  product  term  line  is  connected  to  only 
one  output  line.  A  short  between  two  product  term  lines 
will  result  in  a  noncode  output  if  the  input  selects  either 
one  of  these  lines.11 

Product  term  lines  are  not  susceptible  to  weak  0/1 
faults  since  each  product  term  line  is  connected  to  only 
one  output  line  (fanout  of  one)  so  that  a  weak  0/1  fault  is 
equivalent  to  a  ring  It  stuck-at  fault.  Input  lines  have  a 
fanout  greater  than  one  and  are  thus  susceptible  to 
weak  0/1  faults.  A  weak  1  fault  on  an  input  line  is 
equivalent  to  one  or  more  missing  devices  in  the  AND 
array.  Each  product  term  that  is  connected  to  a 
"missing  device”  will  be  selected  by  an  input  code  word 
that  also  selects  a  product  term  line  that  is  connected  to 
the  other  output  line.11  Thus,  a  noncode  output  will 
result.  A  weak  0  fault  on  an  input  line  is  equivalent  to 
one  or  more  product  term  lines  which  are  stuck-at-0. 
Any  code  input  that  is  supposed  to  select  one  of  these 
product  terms  will  result  in  a  noncode  output. 

In  CMOS  chips  PLAs  are  usually  implemented  In 
dynamic  "pseudo  NMOS"12  All  product  term  and  output 
lines  are  precharged  during  every  clock  cycle  before 
being  selectively  discharged  according  to  the  input. 
Therefore  no  state  is  preserved  from  one  cycle  to  the 
next  and  the  circuit  is  combinational  despite  any  opens 
in  the  precharge  or  discharge  paths.11  Hence  the  PLA 
used  in  CMOS  chips  is  only  susceptible  to  the  same  faults 
as  the  traditional  static  PLA  used  in  NMOS  chips. 


This  analysis  shows  that  for  all  single  faults  in  our 
fault  model,  there  exists  a  code  input  that  results  in  a 
noncode  output  from  the  proposed  two-rail  code  checker 
PLA.  Thus,  the  checker  is  self-testing  with  respect  to 
any  likely  single  fault.  Based  on  this  result,  it  can  be 
shown  that  the  checker  constructed  as  a  tree  of  smaller 
self-testing  checkers  (Fig.  3)  is  also  self-testing  with 
respect  to  any  likely  single  fault.11 

Summary  and  Conelueion* 

We  have  presented  an  approach  to  increasing  the 
reliability  of  future  high-end  systems  beyond  what  is 
possible  with  technological  solutions  alone.  The  system 
consists  of  computation  nodes  and  communication 
nodes,  interconnected  by  high-speed  dedicated  links 
These  components  are  relied  upon  to  detect  errors  while 
system  level  protocols  are  used  for  error  recovery  and 
reconfiguration. 

The  use  of  duplication  and  matching  for 
implementing  the  self-checking  nodes  allows  us  to 
restrict  a  detailed  analysis  of  the  impact  of  all  possible 
faults  to  the  comparator,  which  is  a  relatively  simple 
circuit.  We  have  shown  that  the  self-testing  comparator, 
which  is  the  backbone  of  our  approach,  can  be 
implemented  with  NMOS  and  CMOS  technologies. 
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Abstract 

A  high  probability  of  detecting  errors  caused  by  hardware  faults  is  an  essential 
property  of  any  fault-tolerant  system.  VLSI  technology  makes  the  use  of  duplication 
and  matching  for  error  detection  practical  and  attractive.  A  critical  circuit  in  this 
context  is  a  self-testing  comparator.  Faults  in  the  comparator  must  be  detected  so 
that  they  do  not  mask  discrepancies  between  the  duplicated  modules. 

This  paper  discuses  the  implementation  of  comparators  which  are  self-testing  with 
respect  to  faults  caused  by  any  single  physical  defect  likely  to  occur  in  NMOS  and  CMOS 
integrated  circuits.  A  new  fault  model  for  PLAs  is  presented.  This  model  reflects  several 
physical  defects  in  VLSI  circuits  which  are  not  accounted  for  in  previously  published 
models.  It  is  shown  that  in  a  self-testing  comparator,  implemented  as  a  single  two-level 
NOR-NOR  PLA,  the  number  of  required  product  terms  grows  exponentially  with  the 
number  of  input  bits.  A  particular  design  of  a  comparator  using  a  single  two-level  NOR- 
NOR  PLA  is  discussed.  The  operation  of  this  comparator  under  the  faults  in  the  fault 
model  is  analyzed  in  detail.  The  comparator  is  proven  to  be  self-testing  with  respect  to 
any  likely  single  fault  in  the  proposed  fault  model,  provided  that  several  guidelines 
about  its  physical  layout  are  followed.  The  use  of  this  comparator  as  a  basic  building 
block  of  fault-tolerant  systems  is  discussed. 
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I.  INTRODUCTION 

Ideally,  a  computer  system  must  always  generate  the  "correct"  results.  Since  real 
systems  suffer  from  design  faults,  fabrication  defects,  and  other  hardware  faults,  this 
ideal  can  never  be  reached.  A  minimum  requirement  from  any  computer  system  is  that  it 
must  not  mislead  the  "outside  world"  into  accepting  incorrect  outputs  as  correct. 
Hence,  the  "outside  world"  must  be  notified  (or  be  able  to  detect)  when  errors  occur. 
Fhult- tolerant  systems  attempt  to  achieve  a  higher  probability  of  producing  correct 
outputs  by  adapting  to  changes  in  the  system  caused  by  faults,  and  continuing  correct 
operation.  Explicit  or  implicit  error  detection  is  not  only  required  for  preventing 
acceptance  of  incorrect  outputs  but  is  also  a  necessary  first  step  of  any  scheme  for 
"recovering"  from  faults  [2]. 

Checker  circuits  play  a  key  role  in  systems  with  on-line  error  detection.  They 
continuously  verify  that  certain  sets  of  lines  in  the  system  carry  values  that  together 
conform  to  some  code.  Some  of  the  typical  codes  used  are:  parity  codes,  m-out-of-n 
codes,  arithmetic  codes,  and  two-rail  codes  [28].  When  such  codes  are  used,  it  is  assumed 
that  faults  will  cause  the  values  on  the  monitored  lines  to  change  in  such  a  way  that  they 
will  no  longer  conform  to  the  code.  If  faults  modify  the  values  in  unexpected  ways  so 
that  the  incorrect  values  still  conform  to  the  code,  the  error  cannot  be  detected. 

In  modern  VLSI  chips  the  traditional  fault  model,  that  takes  into  account  only  single 
stuck-at  faults,  is  no  longer  valid  [13,  27],  For  example,  a  single  physical  defect  may 
result  in  erroneous  values  on  several  lines  or  may  convert  a  combinational  circuit  into  a 
finite  state  machine.  Hence,  simple  error-detecting  codes,  such  as  a  single  parity  bit. 
may  fail  to  detect  many  of  the  possible  errors.  Furthermore,  evaluating  the  percentage 
of  faults  that  can  be  detected  by  a  given  scheme  (fault-coverage)  is  very  difficult  since 
we  cannot  use  the  traditional  method  of  simulating  the  effects  of  all  possible  faults  [20]. 

No  single  type  of  error-detecting  code  is  suitable  for  use  throughout  a  complex  chip 
such  as  a  microprocessor.  While  Hamming  codes  may  be  used  for  detecting  errors  in 
registers  or  bus  transfers  [28],  arithmetic  codes  [3]  are  needed  for  checking  the 
operation  of  the  ALU,  and  duplication  and  matching  must  be  used  for  checking  control 
lines  and  logical  operations  [8, 26].  The  need  to  use  different  codes  and  implement 
different  checkers  in  different  parts  of  the  chip  exacerbates  the  already  difficult 
problem  of  managing  the  complexity  of  VLSI  chip  design  [21].  The  increase  in  the 
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complexity  of  the  design  and  of  its  verification  decreases  the  confidence  that  the  design 
is  correct,  thereby  reducing  the  overall  reliability  of  the  system. 

Modern  VLSI  technology  makes  duplication  and  matching  an  attractive  and 
economically  feasible  way  of  implementing  error  detection.  Duplicate  high-level 
functional  modules,  such  as  microprocessors  or  communication  chips,  can  run  in  parallel, 
and  errors  can  be  detected  by  comparing  module  outputs  [10, 22, 24].  Errors  are 
detected  as  long  as  the  first  time  one,  or  both,  of  the  two  modules  faii(s),  they  produce 
different  outputs.  There  is  no  need  to  predict  exactly  how  defects  will  affect  the  modules 
and  there  is  no  increase  in  the  complexity  of  designing,  verifying  the  design,  and  testing 
the  chips. 

Duplication  and  matching  may  also  be  used  to  alleviated  the  problem  of  system 
failure  due  to  chips  with  fabrication  defects  or  undetected  design  faults.  Separately 
designed  functional  modules  can  be  used  in  order  to  detect  latent  design  and  fabrication 
faults  during  the  operation  of  the  system  and  prevent  undetected  incorrect  outputs  [4]. 
Using  different  designs  also  virtually  eliminates  the  possibility  that  the  two  modules  fail 
in  exactly  the  same  way  at  exactly  the  same  time. 

The  critical  element  in  any  duplication  and  matching  scheme  is  the  circuit  that 
compares  the  outputs  from  the  two  functional  modules.  Undetected  faults  in  this 
comparator  can  mask  discrepancies  between  the  outputs  of  the  functional  modules. 
Hence,  the  comparator  must  be  self-testing  [l],  i.e.,  during  normal  operation  physical 
defects  in  the  comparator  must  result  in  an  error  indication. 

This  paper  discusses  the  design  and  implementation  of  self-testing  comparators  in 
VLSI.  It  focuses  on  designs  based  on  programmable  logic  arrays  (PLAs)  [18].  Large  VLSI 
chips  are  far  too  complex  to  allow  detailed  analysis  of  all  the  possible  physical  defects 
that  can  occur  and  of  the  effect*  of  these  defects  on  the  operation  of  the  circuit.  On  the 
other  hand.  PLAs  are  characterized  by  a  simple  regular  structure  and  are  therefore  more 
amenable  to  thorough  analysis.  Based  on  such  analysis,  a  new  fault  model  for  PLAs  is 
developed.  The  model  reflects  some  physical  defects  that  are  likely  to  occur  in 
integrated  circuits  but  are  not  taken  into  account  in  previously  published  models. 

Comparators  implemented  with  two-level  AND-OR  or  NOR-NOR  circuits,  which  are 
claimed  to  be  self-testing,  have  been  presented  in  the  literature  [6, 29].  We  show  that 
these  designs,  which  require  that  the  number  of  product  terms  grow  exponentially  with 
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the  number  of  input  bits,  are  optimal  in  terms  of  size.  We  present  a  formal  proof  that  a 
comparator  implemented  as  a  NOR-NOR  PLA,  based  on  these  designs,  is  self-testing  with 
respect  to  most  single  faults  in  the  new  fault  model.  We  propose  a  few  simple  layout 
guidelines  that  help  ensure  that  the  comparator  is  self-testing  with  respect  to  the 
remaining  faults. 

Finally,  we  discuss  the  application  of  the  self-testing  comparator  as  a  basic  building 
block  for  implementing  fault-tolerant  systems. 

U.  Thb  Fault  Model 

In  order  to  design  and  implement  self-testing  circuits,  it  is  necessary  to  consider  the 
physical  defects  that  are  likely  to  occur  with  the  specific  technology  being  used. 
However,  the  design  is  done  at  the  level  of  boolean  logic  ("ones  and  zeros")  rather  than 
at  the  level  of  voltages,  currents,  and  charges.  Hence,  once  the  likely  physical  defects 
are  known,  it  is  desirable  to  determine  the  effects  that  these  defects  have  on  the 
operation  of  the  circuit  at  the  logical  level.  A  description  of  these  effects  is  called  a 
fault  model.  In  this  section  we  present  a  fault  model  for  general  NMOS  and  CMOS  VLSI 
circuits  and  use  it  to  develop  a  detailed  fault  model  for  PLAs. 

A.  Faults  in  MOS  VLSI  Digital  Circuits 

The  failure  of  a  VLSI  chip  may  be  due  to  design  or  fabrication  flaws,  environmental 
factors,  or  a  combination  of  the  two  [11, 14].  The  resulting  physical  defects  consist 
mainly  of  breaks  in  lines,  shorts  between  lines  at  the  same  interconnection  level 
(metallization,  diffusion,  and  poly-silicon),  shorts  through  the  insulator  separating 
different  levels,  shorts  to  the  substrate,  and  large  imperfections  such  as  scratches 
across  the  chip  [9, 13].  Other  possible  defects  are  incorrect  dosage  of  ion  implants, 
contact  windows  that  fail  to  open,  and  misplaced  or  defective  bonds [ll].  During  the 
operation  of  the  chip,  faults  may  also  be  caused  by  power  supply  fluctuation,  and 
ionizing  or  electromagnetic  radiation [7,  ll]. 

While  the  stuck-at  fault  model[l2]  can  represent  the  effects  of  a  significant 
percentage  of  the  physical  defects  that  occur  in  modern  NMOS  and  CMOS  VLSI  circuits,  it 
cannot  represent  the  effects  of  several  other  possible  defects  and  is  therefore 
insufficient  [9, 13, 27].  The  effects  of  most  defects  can  be  represented,  at  the  logical 
level,  by  a  circuit  model  that  consists  of  a  network  of  switches,  loads  (for  NMOS),  and 
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interconnection  lines  which  directly  correspond  to  the  transistors  and  interconnections 
in  the  actual  circuit  [13].  Shorts  and  breaks  in  lines  can  be  represented  with  this  circuit 
model  in  an  obvious  way[9].  Shorts  to  "ground"  and  "power"  are  the  traditional  stuck- 
at  faults.  A  "switch”  may  be  permanently  on  or  permanently  off,  corresponding  to  a  gate 
input  stuck-at-1  or  0,  respectively.  Shorted  NMOS  loads  (pull-ups)  are  equivalent  to  an 
output  line  s-a-1.  Disconnected  gate  inputs  are  usually  equivalent  to  s-a-0  or  s-a-1 
faults.  A  single  break  in  a  line  that  fans  out  to  many  inputs  is  equivalent  to  multiple 
stuck-at  faults  (all  of  the  same  type). 

Some  physical  defects  have  a  more  complex  effect  on  the  circuit.  In  NMOS, 
incorrect  dosage  of  ion  implants  may  cause  a  threshold  shift  in  a  load  transistor.  This 
can  result  in  an  output  voltage  that  lies  between  the  voltages  assigned  to  logic  0  and 
logic  1.  If  the  fanout  from  the  gate  is  greater  than  one,  some  of  the  attached  gates  may 
"interpret”  its  output  as  logic  1  while  others  will  interpret  it  as  logic  0.  If.  at  some  point 
in  time  (clock  cycle),  the  line  is  supposed  to  be  a  logic  1  but  is  interpreted  by  at  least 
one  of  the  gates  as  logic  0.  we  call  it  a  weak.  1  fault.  Conversely,  if  the  line  is  supposed  to 
be  a  logic  0  but  is  interpreted  by  at  least  one  of  the  gates  as  logic  1,  we  call  it  a  weak  0 
fault  [24].  It  is  clear  that  a  line  my  exhibit  both  a  weak  0  fault  and  a  weak  1  fault,  as  a 
result  of  a  single  physical  defect. 

A  stuck-at-1  fault  is  a  degenerate  case  of  a  weak  0  fault  while  a  stuck-at-0  fault  is  a 
degenerate  case  of  a  weak  1  fault.  If  a  line  is  stuck-at-1,  all  the  devices  connected  to  it 
always  interpret  its  value  as  logic  1.  If  a  line  has  a  weak  0  fault,  at  least  one  of  the 
devices  connected  to  it  always  interprets  it  as  a  logic  1. 

Breaks  in  lines  are  another  possible  source  of  weak  0  and  weak  1  faults.  A  break 
may  result  in  a  segment  of  the  line  that  is  only  connected  to  gates  of  MOS  transistors  and 
is  therefore  essentially  "floating;1  The  gates  connected  to  the  floating  segment  of  the  line 
receive  an  incorrect  value  for  the  line  in  one  of  its  states  (0  or  1). 

A  single  break  in  the  line  can  result  in  the  line  being  stuck-at-1  if  all  the  pull-down 
devices  are  disconnected  from  the  rest  of  the  line,  and  in  the  line  s-a-0  if  all  the  pull-up 
(or  load)  devices  are  disconnected  from  the  rest  of  the  line.  Furthermore,  if  only  some  of 
the  pull-up  or  pull-down  devices  are  disconnected  from  the  line,  the  line  may  not  be  s-a-0 
or  s-a-1  but  assume  the  wrong  value  for  some  inputs  that  only  turn  on  the  disconnected 
pull-ups  or  pull-downs.  In  the  worst  case,  in  CMOS  or  dynamic  logic  circuits,  a  break  in  a 
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line  or  a  transistor  that  is  permanently  off  can  make  the  output  of  a  supposedly 
combinational  logic  circuit  dependent  on  the  previous  output  rather  than  the  current 
input  alone  [27].  Such  a  fault  (called  a  stuck-open  fault)  may  escape  detection  even  if  all 
possible  input  vectors  are  used  to  test  the  circuit  [27]. 

A  short  between  adjacent  or  crossing  lines  forces  both  lines  to  have  the  same  value. 
This  value  may  lie  between  logic  0  and  logic  1.  Hence,  if  the  two  lines  are  supposed  to 
carry  complementary  values,  the  line  that  is  supposed  to  be  at  logic  1  may  have  a  weak  1 
fault  and  the  line  that  is  supposed  to  be  at  logic  0  may  have  a  weak  0  fault.  If  the  circuit 
is  designed  so  that  a  short  always  forces  both  lines  to  a  well  defined  logic  value,  this 
value  may  be  always  the  value  of  one  of  the  lines  that  "dominates"  because  it  is  driven 
with  larger  devices,  or  it  may  always  be  logic  0  (AND  operation)  or  always  logic  1  (OR 
operation). 

We  assume  that  if  the  fault  is  transient,  the  circuit  returns  to  its  original  physical 
structure  after  the  fault  has  disappeared.  It  is,  of  course,  possible  for  a  transient  fault 
to  cause  a  permanent  change  in  the  state  of  a  circuit  with  memory  elements.  We  assume 
that,  for  the  duration  of  the  fault  (defect),  the  effects  of  the  defect  are  deterministic  so 
that  under  identical  conditions  the  effects  of  a  particular  defect  are  always  the  same. 
Thus,  if  a  line  has  a  weak  1  fault  due  to  its  driver,  the  devices  connected  to  it  which 
misinterpret  the  logic  1  as  a  logic  0,  always  misinterpret  the  logic  1  as  a  logic  0. 

Traditionally,  the  term  single  fault  has  been  used  to  denote  an  erroneous  logic 
value  on  a  single  line  in  the  circuit.  From  the  above  it  is  clear  that  a  single  physical 
defect  may  result  in  erroneous  logic  values  on  several  lines  in  the  circuit.  Hence,  we  will 
use  the  term  single  fault  to  denote  the  effect,  at  the  logical  level,  of  a  single  physical 
defect. 

B.  A  Fault  Model  for  MOS  PLAs 

For  any  VLSI  circuit,  the  effect  of  a  single  fault  on  the  output  is  dependent  on 
layout  details  such  as  which  lines  are  adjacent,  which  lines  cross  each  other,  etc.  One  of 
the  advantages  of  using  PLAs  is  that  their  regular  structure  simplifies  analysis  of  the 
effects  of  faults  on  its  outputs  and  therefore  facilitates  test  vector  generation  and 


determination  of  fault  coverage.  In  this  section  we  discuss  how  the  faults  discussed 
above  affect  the  operation  of  a  two-level  N0R-N0R  MOS  PLA.  To  facilitate  this  discussion, 
a  "typical"  NMOS  PLA  is  shown  in  Fig.  1. 


Fig.  1:  A  Self-Testing  NMOS  Two- Rail  Code  Checker 

The  most  elementary  fault  model  used  for  PLAs  includes  three  types  of 
faults  [16. 19. 29]: 

(I)  A  stuck-at  fault  on  an  input  line,  product  term  line,  or  output  line. 

(II)  A  short  between  two  adjacent  or  crossing  lines  that  forces  both  of  them  to  the  same 
logic  value. 

(III)  A  missing  or  extra  crosspoint  device  in  the  AND  array  or  in  the  OR  array. 

The  first  two  types  of  faults  were  explained  above  and  correspond  directly  to  physical 
defects  in  the  circuit.  The  third  type  of  faults  refers  to  faults  whose  effect  on  the 
operation  of  the  circuit  is  equivalent  to  the  effect  of  a  missing  or  extra  crosspoint 
device.  This  may  be  the  result  of  the  gate  of  the  crosspoint  device  stuck-at  its  "off” 
value  (0  for  NMOS.  1  for  PMOS)  when  it  should  be  connected  to  an  input  or  product  term 
line,  or  connected  to  an  input  or  product  term  line  when,  by  design,  it  should  be 
permanently  held  at  its  "off"  value. 

A  missing  crosspoint  device  has  the  same  effect  as  a  device  that  always 
misinterprets  the  line  that  drives  it  as  a  logic  0  even  when  it  is  a  logic  1.  Thus,  a  missing 
crosspoint  device  fault  in  the  AND  array  is  equivalent  to  a  weak  1  fault  on  the 
corresponding  input  line  while  a  missing  crosspoint  device  fault  in  the  OR  array  is 
equivalent  to  a  weak  1  fault  on  the  corresponding  product  term  line.  Hence,  if  weak  1 
faults  on  input  lines  and  product  term  lines  are  considered,  there  is  no  need  to  consider 


missing  crosspoint  device  faults  separately. 

The  above  three  fault  types  do  not  include  weak  0  and  weak  1  faults  or  breaks  in 
lines  that  are  not  equivalent  to  stuck  faults.  Since  breaks  in  lines  are  one  of  the  main 
causes  of  failures  in  VLSI  circuits  [9, 13],  it  is  clear  that  the  above  simple  fault  model 
does  not  accurately  reflect  the  possible  physical  defects  in  a  MOS  PLA. 

Some  of  the  effects  of  breaks  on  general  MOS  circuits  cannot  occur  in  PLAs  due  to 
their  structure.  This  fact  can  be  used  to  reduce  the  complexity  of  the  fault  model  that 
must  be  considered  in  analyzing  the  operation  of  PLAs  under  faults.  One  such 
simplification  relies  on  the  fact  that  input  lines  are  only  connected  to  gates  of  devices  in 
the  PLA.  A  break  in  an  input  line  causes  a  segment  of  that  line  to  "float"  and  is 
therefore  equivalent  to  a  weak  0  and/or  weak  1  fault.  Hence,  if  weak  0/1  faults  on 
inputs  lines  are  taken  into  account,  there  is  no  need  to  consider  breaks  in  input  lines 
separately. 

Another  important  simplification  of  the  fault  model  is  based  on  the  fact  that 
product  term  lines  and  output  lines  only  have  one  pull-up  (load)  device  and  that  this 
device  is  independent  of  the  inputs  to  the  circuit.  Every  point  in  a  product  term  or 
output  line  is  either  connected  to  the  single  pull-up  (load)  or  permanently  disconnected 
from  it  (due  to  a  break).  For  any  input,  segments  of  the  line  that  are  connected  to  the 
pull-up  are  either  set  to  logic  1  or  set  to  logic  0  by  some  pull-down  device  that  is  turned 
on  by  that  particular  input.  A  segment  of  the  line  that  is  disconnected  from  the  pull-up 
is  set  to  logic  0  by  the  first  input  that  is  supposed  to  set  it  to  0  and  stays  stuck-at-0  for 
a  long  time  thereafter.  Hence  no  state  is  preserved  on  lines  between  inputs  (clock 
phases).  The  troublesome  faults  that  can  convert  a  general  combinational  circuits  into  a 
sequential  circuit  cannot  occur. 

Based  on  the  above  discussion,  a  realistic  fault  model  for  PLAs  must  include 
weak  0/1  faults  as  well  as  the  possible  effects  of  breaks  in  product  term  lines  and  output 
lines.  Specifically,  the  following  faults  must  be  considered: 

(A)  Weak  0  or  weak  1  or  both  on  one  input  line. 

(B)  A  short  between  two  adjacent  input  lines. 

(C)  Weak  0  or  weak  1  or  both  on  one  product  term  line. 

(D)  A  short  between  two  adjacent  product  term  lines. 

(E)  Weak  0  or  weak  1  or  both  on  one  output  line. 


(F)  A  short  between  two  adjacent  output  lines. 

(G)  A  short  between  an  input  line  and  a  crossing  product  term  line. 

(H)  A  short  between  a  product  term  line  and  a  crossing  output  line. 

(I)  An  extra  crosspoint  device  in  the  AND  array. 

(J)  An  extra  crosspoint  device  in  the  OR  array. 

(K)  A  break  in  a  product  term  line. 

(L)  A  break  in  an  output  line. 

m.  Background  and  terminology 

Since  code  checkers  are  key  elements  in  computer  systems  with  on-line  error 
detection,  the  design  and  implementation  of  various  self-testing  checkers  has  been  an 
active  research  area  for  many  years.  This  section  includes  a  discussion  of  some  of  that 
work  that  is  relevant  to  self-testing  comparators  in  VLSI.  In  addition,  the  terminology 
and  notation  that  will  be  used  in  the  rest  of  the  paper  are  introduced. 

We  will  assume  that  two  n-bit  vectors,  A  =  (on-i.  On-2 . °o)  and 

B  -  (bn„ltbn_2 . 60)  ,  are  to  be  compared.  In  much  of  the  literature  two-rail  code 

checkers  rather  than  comparafors  are  discussed.  Given  two  n-bit  vectors 

X  =  . x0)  and  Y  ~  ( Vn-i-Vn-z . Vo)  .  the  combined  2 n  bit  vector 

XY  =  (2*-! . *o.yn-i . Vo)  is  a  two-rail  code  word  if  x*  =  y\  for  all  i  such  that 

OSt  <  n—  1  (where  y\  means  the  complement  of  y<).  We  will  use  B’  to  denote  an  n-bit 
vector  whose  elements  are  the  complements  of  the  elements  of  B  ,  i.e., 

B '  =  (6'n_j,6'n_2 . b'0)  .  A  two-rail  code  checker  whose  input  is  the  bit  vector  AB'  is. 

effectively,  a  comparator  of  vectors  A  and  B  .  We  will  start  out  by  making  the 
assumption  (that  will  later  be  shown  to  be  unnecessary)  that  all  the  input  bits  are 
available  in  both  complemented  and  uncomplemented  form.  Hence  there  is  no  difference 
between  the  design  of  comparators  and  two-rail  code  checkers;  and  we  will  use  the  terms 
"comparator"  and  "two-rail  code  checker"  interchangeably. 

For  a  given  fault  set  F  ,  a  checker  is  said  to  be  self-testing  if  for  every  fault  /  e  F 
there  is  a  code  input  that  results  in  a  noncode  output  (error  indication) [j].  When 
duplication  and  matching  is  used  for  on-line  error  detection,  we  assume  that  faults  (both 
permanent  and  transient)  do  not  occur  simultaneously  in  the  duplicated  functional 
modules  and  in  the  comparator.  If  the  first  fault  occurs  in  one  of  the  functional  modules 
and  causes  a  permanent  change  in  the  circuit  or  in  the  state  of  the  circuit,  we  assume 
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that  the  outputs  from  the  two  modules  "disagree"  and  result  in  a  noncode  output  from 
the  comparator  before  any  faults  occur  in  the  comparator.  If  the  first  fault  occurs  in  the 
comparator,  we  assume  that,  before  additional  faults  can  occur  in  the  comparator  or  the 
functional  modules,  the  set  of  code  words  that  will  appear  as  inputs  to  the  comparator 
will  be  sufficient  to  achieve  a  complete  self-test  of  the  comparator.  Depending  on  the 
particular  implementation  of  the  comparator,  this  set  of  code  words  may  have  to  include 
all  possible  code  inputs.  If  the  fault  persists  for  this  duration  and  the  comparator  is 
self-testing,  its  output  will  indicate  an  error  before  a  fault  in  one  of  the  functional 
modules  can  lead  to  an  undetected  erroneous  functional  output  from  the  system.  Based 
on  these  assumptions,  the  comparator  only  needs  to  be  self-testing  with  respect  to  a 
fault  set  F  that  includes  all  single  faults,  as  defined  in  Section  II. 

Pioneering  work  in  the  field  of  self-testing  checkers  was  reported  by  Carter  and 
Schneider [6]  whose  design  of  a  self-testing  two-rail  code  checker  serves  as  a  basis  for 
the  comparator  discussed  in  this  paper.  For  the  case  n  =  2  ,  Carter  and  Schneider 
presented  a  design  of  a  circuit  that  checks  whether  the  input  is  a  two-rail  code  word  and 
that  is  also  self-testing  with  respect  to  any  single  stuck-at  fault  [6].  The  circuit,  shown  in 
Fig.  2,  has  two  output  lines  ct  and  c0  where  (c^Cq)  =  (0.1)  or  (clPc0)  =  (1,0)  for  code 
input,  and  (c^Cq)  =  (0,0)  or  (c^c 0)  =  (1,1)  for  noncode  input. 


Fig.  2:  A  Self- Testing  Two- Rail  Code  Checker^ 6] 

Carter  and  Schneider's  checker  has  the  property  that,  with  no  faults,  every  line  in 
the  circuit  is  0  for  at  least  one  code  input  and  1  for  at  least  one  code  input.  If  any  line  is 
stuck-at-0  (s-a-0)  or  s-a-1,  the  code  input  for  which  the  line  is  supposed  to  be  at  1  or  0, 
respectively,  results  in  the  output  (0,0)  or  (1,1). 
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Wang  and  Avizienis  [29]  extended  Carter  and  Schneider’s  design  to  arbitrary  size 
input  vectors.  For  each  one  of  the  2"  input  code  words  there  is  a  single  unique  product 
term  that  is  selected  (set  to  1)  only  by  that  code  word.  Each  product  term  line  selects 
exactly  one  of  the  two  output  lines.  Depending  on  the  parity  of  the  vector 

A  =  (an_|,an_2 . ac)  ,  half  the  code  inputs  select  output  c0  and  the  other  half  select 

output  ct. 

The  checker  proposed  by  Wang  and  Avizienis  can  be  described  by  sum-of-products 
equations  as  follows:  For  any  integer  k  ,  let  Ik  denote  the  set  of  the  k  integers 
between  0  and  k-l,i.e.  ^  =[0.1 . A:  — 2,Ar  —  1  ]  .  If  Q  is  a  set.  let  |9I  denote  the 


number  of  elements  in  Q 


c0  =  £ 

(9!Cc4and|CI«^n! 


Cl-  £ 

f0l0c4*nd|G|<xttj 


[  n  “(I  n  b'i 


( n  «<] 


n  b; 


In  NOR-NOR  form  similar  functionality  can  be  achieved  based  on  the  Equation  (2). 
An  NMOS  PLA  which  implements  these  equations  for  the  case  n  =  2  is  shown  in  Fig.  1.  It 
should  be  noted  that  there  are  a  total  of  2n  input  bits  to  this  circuit:  all  the  "a"  bits 
uncomplemented  and  all  the  "b"  bits  complemented.  Each  "product  term"  contains 
exactly  n  literals. 

c0=  NOR  NORUdi  |ie?}  \J  [6',- 1;  e(4~9)j]  (2) 

0  I9l9c4and|9|o<Mj]  1  W  3  ^  nl\  V 

c,  =  NOR  Nod\Oi\izQ\  \j  {&'•  |;e(4-9)il 

1  |9!9c4«nd|9l«vw»|[  w  (  3U  ni 

In  Section  V  it  is  shown  that  the  circuit  described  by  Equation  (2)  is,  in  fact,  a 
comparator.  In  Sections  VI  and  VII  it  is  shown  that  the  circuit  can  be  implemented  so 
that  it  is  self-testing  with  respect  to  the  fault  model  presented  in  Section  II. 

In  the  literature,  checkers  which  are  fault-secure  as  well  as  self-testing  have  often 
been  discussed [l,  17].  A  circuit  is  said  to  be  fault-secure  if,  for  every  fault  in  the 
prescribed  fault  set,  the  circuit  never  produces  an  incorrect  code  output  for  code 


inputs  [l].  A  checker  only  provides  one  bit  of  information  to  the  rest  of  the  system 
regarding  its  inputs;  the  checker’s  output  is  code  or  noncode  depending  on  whether  its 
input  is  code  or  noncode,  respectively.  As  long  as  this  binary  distinction  is  performed 
correctly,  it  does  no  matter  exactly  which  code  output  (or  which  noncode  output)  is 
produced  by  the  checker.  Thus,  the  concept  of  fault-secure  checkers  is  meaningless  [23] 
and  the  fault-secure  property  will  not  be  considered  further  in  this  paper. 

IV.  Optimal  design  op  Self-Testing  Comparators  using  Two- level  Logic 

Published  work  on  self-testing  checkers  usually  consists  of  a  circuit  design  and  a 

proof  that  the  circuit  is  self-testing.  There  has  been  no  attempt  to  show  that  the 

proposed  designs  are  optimal  in  any  respect.  The  self-testing  comparator  design 

proposed  by  Wang  and  Avizienis  requires  2n  product  terms  for  comparing  n-bit  vectors. 

However,  it  is  possible  to  implement  a  comparator  that  has  two  outputs  that  are  (0.1)  or 

(1,0)  for  code  inputs  and  (1,1)  for  noncode  inputs  based  on  the  equations: 

<=n-l  i=n-i 

c0=a'0+&o+  2j  (“i^+a'A)  c1=a0+60+  £  (aibi+a'ibi) 

i*i  <=i 

This  comparator  is  self-testing  with  respect  to  faults  in  the  input  lines  and  output  lines 
but  requires  only  4n  product  terms.  However,  this  comparator  is  nof  self-testing  with 
respect  to  stuck  faults  on  the  product  term  lines.  The  question  thus  arises  whether  2n 
is  the  minimum  number  of  product  terms  necessary  for  a  comparator  that  is  self-testing 
with  respect  to  a  realistic  fault  model  which  also  takes  into  account  faults  affecting  the 
product  term  lines. 

Since  the  comparator  must  be  self-testing  with  respect  to  stuck-at  faults  on  the 
output  lines,  it  must  have  at  least  two  output  lines  [6].  The  use  of  more  than  two  lines 
has  been  proposed  [23];  however,  since  limited  communication  bandwidth  is  a  severe 
problem  in  VLSI  systems,  it  is  preferable  to  minimize  the  bandwidth  dedicated  to 
transmitting  self-testing  information.  Hence  we  will  only  consider  comparators  with  two 
output  lines. 

There  are  two  possible  ways  to  "code”  the  output  from  the  comparator  and  still 
allow  self-testing  of  the  output  lines:  (A)  The  code  output  is  (0,1)  or  (1,0)  and  the 
noncode  (error  indication)  output  is  (0,0)  or  (1,1).  (B)  The  code  output  is  (0,0)  or  (1,1) 
and  the  noncode  (error  indication)  output  is  (0,1)  or  (1,0).  Option  (A)  is  preferable  since 
it  allows  the  comparator  to  be  self-testing  with  respect  to  shorts  between  the  output 


^  a'*  ,*•  »*•  . ^  .*•  r*»  , 
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lines  as  well  as  any  other  fault  that  causes  a  unidirectional  error.  A  unidirectional  error 
means  that,  due  to  a  fault,  some  lines  that  are  supposed  to  be  at  logic  0  are  at  logic  1  or 
some  of  the  lines  that  are  supposed  to  be  at  logic  1  are  at  logic  0,  but  not  both..  It  has 
been  shown  that  the  faults  that  are  most  likely  to  occur  in  PLAs  (fault  types  (I),  (II),  and 
(III)  in  Section  II).  can  result  only  in  a  unidirectional  error[l6].  Therefore  only 
comparators  with  the  option  (A)  encoding  of  the  outputs  will  be  considered. 


Although  the  design  of  a  self-testing  comparator  presented  by  Wang  and 
Avizienis  [29]  uses  2n  product  terms,  one  for  each  code  input,  this  is  never  shown  to  be 
a  necessary  property  of  self-testing  comparators  implemented  with  PLAs.  In  several 
papers  [15, 29]  it  is  claimed  that  it  is  "desirable"  to  use  PLAs  that  are  nonconcurrent,  i  e., 
where  each  code  input  selects  only  one  product  term.  Wang  and  Avizienis  propose  a 
general  approach  to  the  design  of  self-testing  PLAs  that  always  results  in  a 
nonconcurrent  circuit.  They  also  give  an  example  of  a  PLA  where  concurrency  leads  to  a 
circuit  which  is  not  self-testing  [29].  However,  nonconcurrency  is  not  presented  as  a 
necessary  property  nor  is  there  any  mention  of  a  problem  with  product  terms  that  are 
selected  by  more  than  one  code  input. 

In  this  section  it  is  shown  that  the  exponential  growth  in  the  number  of  product 
term  lines  is  indeed  necessary  for  self-testing.  For  any  two-level  NOR-NOR 
implementation,  it  is  shown  that  every  code  input  must  select  exactly  one  product  term 
line  and  that  no  two  different  code  inputs  can  select  the  same  product  term  line.  This  is 
necessary  even  if  the  only  faults  considered  are  single  stuck-at  faults  on  the  input, 
output  and  product  lines.  The  proof  that  the  same  requirement  also  applies  to  two-level 
AND-OR  implementations  is  almost  identical  and  will  not  be  presented  here. 
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Lemma  1:  Every  product  term  must  be  selected  (set  to  1)  by  least  one  code  word. 

Proof:  Assume  that  there  is  a  product  term  that  is  not  selected  by  any  code  word. 

A  stuck-at-0  fault  on  this  product  term  line  will  not  be  detected  during  normal  operation 
thus  violating  the  self-testing  requirement. 

Lemma  2:  Every  code  word  must  select  at  least  one  product  term. 

Proof:  If  there  is  some  code  word  that  does  not  select  any  product  term,  the 

comparator  output  for  that  code  word  will  be  the  noncode  output  (l.l),  which  incorrectly 
signals  an  error. 

Lemma  3:  All  the  product  terms  selected  by  a  single  code  word  must  be  connected  to 


the  same  single  output  in  the  OR  array. 

Proof:  If  any  of  the  product  terms  selected  by  a  code  word  is  connected  to  both 

outputs  in  the  OR  array,  then,  for  that  code  word,  the  output  will  be  the  noncode  output 
(0,0),  which  incorrectly  signals  an  error.  Similarly,  the  output  will  be  (0,0)  if  the  product 
terms  are  not  all  connected  to  the  same  output  line. 

Lemma  4:  No  product  term  can  be  selected  by  more  than  one  code  word. 

Proof :  By  contradiction.  Assume  that  Pi  is  a  product  term  which  is  selected  by  the 

two  code  inputs  AA  =  ( a ^ . . °o)  and  BB  =  (b„-i . &o.bn-i . bo)  • 

Since  the  two  code  words  are  different,  there  exists  an  integer  k  (0  ^  A:  ^n— 1)  such 
that  ak  *  bk  .  P*  is  selected  only  if  all  the  literals  in  the  expression  corresponding  to 
Pi  are  0.  Since  Pt  is  selected  by  both  code  words,  it  must  be  independent  of  bit  k 
from  the  two  functional  modules.  Hence  Pt  is  also  selected  by  the  code  input 

WW  =  (qti_1 . a'k . do.a,,.! . a'k . a0)  and  by  the  noncode  input 

Q  =  (On-l . . “o.On-1 . ak . “o)- 

Since  Q  is  a  noncode  input,  the  corresponding  output  produced  by  the  comparator 
must  be  noncode.  When  Pt  is  selected,  it  sets  to  0  the  one  output  line  it  is  connected  to. 
Hence,  Q  must  select  another  product  term,  Pj  ,  connected  to  the  other  output  line,  so 
that  the  noncode  output  (0.0)  will  be  produced.  By  Lemma  1,  Pj  must  also  be  selected 

by  at  least  one  code  word  CC  =  (cn_! . ck . c0,cn_, . ck . c0)  . 

Since  in  CC  .  bit  k  from  both  functional  modules  is  the  same,  and  in  Q  bit  k  from  one 
unit  is  the  complement  of  bit  k  from  the  other  unit,  Pj  must  not  include  the  literal 
corresponding  to  bit  k  from  at  least  one  of  the  two  functional  modules.  Hence,  since 
Q  selects  Pj  ,  either  AA  or  WW  must  also  select  Pj  .  Without  loss  of  generality, 
assume  WW  selects  Pj  .  From  the  above,  WW  also  selects  Pt  .  But  in  the  OR  array  Pj 
is  connected  to  a  different  output  line  from  Pt  .  Hence,  Lemma  3  above  is  ’’violated"  and 
the  code  word  WW  results  in  the  noncode  output  (0,0).  Thus  the  assumption  that  there 
exists  a  product  term  that  is  selected  by  more  than  one  code  input  must  be  incorrect. 
Lemma  5:  Every  code  word  must  select  one,  and  only  one,  product  term 
Proof:  By  Lemma  2,  every  code  word  must  select  at  least  one  product  term  Assume 

that  the  code  word  AA  =  (a*.! . . a0)  selects  the  two  product  terms  P* 

and  Pj  .  By  Lemma  4,  no  other  code  word  except  AA  can  select  Pt  or  Pj  .  Hence,  a 
stuck-at-0  fault  on  the  Pt  or  Pj  lines  can  only  be  detected  by  the  input  AA  .  By 
Lemma  3,  both  Pt  and  Pj  must  be  connected  to  the  same  output  line  in  the  OR  array 
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Hence,  when  the  code  word  AA  is  applied,  the  output  from  the  PLA  will  be  the  same 
whether  or  not  one  of  the  product  term  lines  Pt  or  P}-  is  stuck-at-0.  Thus  a  stuck-at-0 
fault  on  one  of  the  product  term  lines  P*  or  P,  will  not  be  detectable  by  any  code  word, 
thus  violating  the  self-testing  requirement. 

Theorem  1:  A  self-testing  comparator  of  two  n-bit  vectors  that  has  two  output  lines  and 
is  implemented  as  a  two  level  NOR-NOR  PLA,  must  have  exactly  2n  product  terms. 

Proof:  By  Lemma  1,  every  product  term  line  is  selected  by  at  least  one  code  word. 

By  Lemma  5.  every  code  word  selects  one,  and  only  one,  product  term  line.  Hence,  the 
number  of  product  term  lines  is  equal  to  the  number  of  code  words.  Since  there  are  n 
bits  of  output  from  each  one  of  the  two  functional  modules,  there  are  2n  code  words. 
Therefore  the  number  of  product  term  lines  is  exactly  2n  . 

Q.E.D. 

Any  comparator  of  two  n-bit  vectors  must  have  at  least  2n  input  lines  (two  lines  for 
every  pair  of  bits  being  compared).  As  previously  discussed,  at  least  two  output  lines  are 
necessary.  Based  on  the  proof  presented  in  this  section,  exactly  2n  product  term  lines 
are  necessary  for  any  two-level  NOR-NOR  implementation.  Hence,  the  design  based  on 
Equation  (2).  which  was  discussed  in  the  previous  section,  is  optimal.  In  the  next  three 
sections  a  PLA  implementation  of  a  self-testing  comparator  based  on  this  design  is 
analyzed  in  detail. 

V.  fault- Free  Operation  of  the  Comparator 

In  the  previous  section  we  proved  that  any  self-testing  comparator  implemented  as 
a  single  two-level  NOR-NOR  PLA  must  have  2n  product  terms.  In  this  section  and  the  two 
subsequent  sections  we  discuss  a  specific  self-testing  comparator  based  on  Equation  (2) 
in  Section  III  which  satisfies  this  necessary  property. 

Although  a  comparator  based  on  Equation  (2)  has  been  discussed  in  the  literature, 
we  could  find  no  rigorous  proof  that  it  indeed  functions  as  a  comparator.  For 
completeness,  we  present  such  a  proof  in  this  section.  To  prove  that,  with  no  faults,  the 
circuit  described  by  Equation  (2)  is  a  comparator,  we  first  show  that  if  A  =  B  ,  the 
output  is  (0,1)  or  (1.0).  Then  we  show  that  if  A  *  B  ,  the  output  is  (0,0)  or  (l.l)  . 

If  A  =  B  ,  there  are  exactly  n  ones  and  n  zeros  at  the  inputs.  If  U  is  a  set  of 

integers  U  -  \i  |  a*  =  0]  .  then  for  every  integer  j  such  that  e(4-^)  .  aj  -  bj  =  1  ■ 

Thus,  bj  =  0  ,  and  the  one  product  term  that  corresponds  to  Q  -  U  in  Equation  (2)  is 


f. 
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selected.  Every  other  product  term  includes  the  literal  a;  for  some  j  £(Jn-U)  or  b\ 

for  some  i  e  U  .  Hence,  all  of  the  other  product  terms  are  set  to  0.  Thus,  only  the  one 

output  line  connected  to  the  single  selected  product  term  is  set  to  0,  and  the  output  is 
(0.1)  or  (1.0)  . 

If  A  *  B  ,  the  two  bit-vectors  differ  by  at  least  one  bit.  Consider  the  product  term 

JVOffffaJieGJ  u  (3) 

for  some  Q  C  7n  .  Assume  that  A  and  B  differ  in  bit  r,  r  €  In  ,  so  Oj.  -  1  and  b'r  =  1 

in  the  input  AB  .  If  r  e  Q  ,  the  product  term  is  set  to  0  since  it  contains  the  literal  ar  . 

If  r  jZ  Q  .  the  product  term  is  set  to  0  since  it  contains  the  literal  b’r  .  Hence,  all  of  the 
product  terms  are  set  to  0  and  the  output  is  (1,1). 

Since  the  two  bit-vectors  differ  by  at  least  one  bit.  if  there  does  not  exist  any 
integer  r  €  In  such  that  a,.  =  1  and  b'r  =  1  ,  there  must  exist  an  integer  s  €  7n  such 
that  a,  =  0  and  b',  =  0  in  the  input  AB  .  If  AB  doesn’t  select  any  product  term,  the 
output  is  (1,1).  Assume  that  the  product  term  that  corresponds  to  Q  -  Q\ 
(Equation  (3))  is  selected.  If  s  e  Qx  ,  consider  the  set  Qz  =  j  .  Since  Qz  C  . 

J^~-Q2  -  In-Ql  +  \s  J  ,  and  b',  =  0  ,  the  product  term  that  corresponds  to  Q  =  Qz  will 
also  be  selected.  If  s  t  Qy  ,  consider  the  set  Since  a,  =  0  and 

A-<?3  — Qi  ,  the  product  term  that  corresponds  to  Q  ~  Q3  will  also  be  selected. 
Thus,  either  the  product  terms  corresponding  to  Qi  and  Qz  will  be  selected,  or  the 
product  terms  corresponding  to  Q j  and  Q3  will  be  selected.  The  number  of  elements  in 

is  one  greater  than  the  number  of  elements  in  Qz  and  is  one  less  than  the  number  of 
elements  in  Qz-  Hence,  either  |  Qz  |  and  |  ^3 1  are  even  while  |  Qy  |  is  odd.  or  |  Qz  I 
and  j^3|  are  odd  while  |  |  is  even.  Thus,  the  product  terms  corresponding  to  Qz 

and  <?3  are  connected  to  the  same  output  line,  which  is  different  from  the  output  line  to 
which  the  product  term  corresponding  to  Ql  is  connected.  Therefore,  product  terms 
connected  to  both  output  lines  are  always  selected  and  the  output  is  (0,0). 

VI.  IDENTIFICATION  AND  ELIMINATION  OF  UNDETECTABLE  FAULTS 

Given  that  the  circuit  described  by  Equation  (2)  functions  as  claimed  when  t  is 
fault-free,  it  remains  to  be  shown  that  the  circuit  is  self-testing  with  respect  to  any 
single  fault  in  the  fault  model  described  in  Section  11.  Specifically,  it  must  be  shown  that 
for  any  such  fault  there  exists  a  code  input  that  results  in  a  noncode  output  (0,0)  or 


(1,1)  from  the  comparator.  In  this  section  it  is  shown  that  there  are  a  few  faults  in  the 
fault  model  with  respect  to  which  the  circuit  is  not  self-testing.  We  refer  to  these 
problematic  faults  as  undetectable  faults.  Layout  guidelines  that  prevent  these  faults 
from  occurring  in  the  actual  circuit  are  discussed. 

A.  A  Short  Between  Adjacent  Product  Term.  Lines 

One  of  the  possible  faults  is  a  short  between  two  adjacent  product  term  lines  that 
forces  both  of  the  lines  to  logic  1  when  they  are  supposed  to  be  carrying  different  values 
(fault  type  (D)).  If  the  two  product  term  lines  are  connected  to  the  same  output  line, 
there  is  no  code  input  that  results  in  a  noncode  output.  In  fact,  the  circuit  continues  to 
function  correctly  despite  this  fault.  The  reason  for  this  is  that  if  one  of  the  product 
term  lines  connected  to  an  output  line  is  selected,  that  output  line  is  set  to  logic  0 
regardless  of  the  value  of  any  other  product  term  connected  to  it.  It  is  undesirable  to 
allow  this  fault  to  remain  undetected  since  the  situation  may  deteriorate  in  time  and 
intermittently  cause  weak  0  or  weak  1  faults  that  will  not  be  detected  and  will  later 
combine  with  an  additional  fault  to  cause  more  serious  undetectable  faults. 

As  indicated  by  Wang  and  Avizienis  [29],  the  possibility  that  this  undetectable  fault 
will  occur  can  be  eliminated  by  ensuring  that  product  term  lines  connected  to  the  same 
output  line  are  not  adjacent.  Since  the  same  number  of  product  term  lines  are 
connected  to  each  output  line,  this  guideline  is  easy  to  obey  and  incurs  no  penalty  in 
terms  of  the  size  or  performance  of  the  circuit.  The  guideline  is  satisfied  by  simply 
alternating  between  product  term  lines  connected  to  one  output  line  and  those 
connected  to  the  other  line. 

B.  A  Short  Between  a  Product  Term  Line  and  an  Output  Line 

Another  potentially  undetectable  fault  is  a  short  between  a  product  term  line,  Px  , 
and  an  output  line,  cm  ,  where  there  is  no  device  at  the  crosspoint  of  the  two  lines.  This 
fault  is  undetectable  if  whenever  the  two  lines  are  supposed  to  carry  a  different  logic 
value,  they  are  both  forced  to  logic  1. 

The  short  between  Pi  and  cm  is  not  detectable  since  the  faulty  circuit  will  behave 
as  follows:  For  the  code  input  XX  that  is  supposed  to  select  ,  Pi  is  supposed  to  be 
at  logic  1  and  cm  at  logic  1  (since  the  other  output  line,  cm-,  is  supposed  to  be  at 
logic  0).  Hence  there  is  no  change  in  the  output  from  the  circuit.  On  the  other  hand,  Px 


is  supposed  to  be  at  logic  0  and  cm  is  supposed  to  be  at  logic  1  for  every  code  input, 
YY ,  such  that  YY  *  XX  and  the  number  of  a*  (  i  £  4,  )  inputs  that  are  at  logic  0  in 
YY  has  the  same  parity  as  the  number  of  a*  inputs  that  are  at  logic  0  in  XX  .  For 
these  code  inputs,  P*  is  forced  to  logic  1  but  this  has  no  effect  on  cm ■  which  is 
supposed  to  be  at  logic  0.  For  the  remaining  2n-1  code  inputs,  Pt  is  supposed  to  be  at 
logic  0  and  cm  is  supposed  to  be  as  logic  0.  Hence  there  is  no  change  in  the  output  from 
the  circuit. 

The  short  between  Pt  and  cm  can  be  made  detectable  if  it  is  possible  to  ensure 
that  when  P*  is  at  logic  0,  it  forces  cm  to  logic  0  as  well.  In  NMOS,  this  can  be  done  by 
using  large  crosspoint  devices  in  the  AND  array  so  that  a  single  device  can  pull  down  two 
load  devices  —  the  output  line  pull-up  as  well  as  the  product  term  line  pull-up.  In  CMOS, 
this  can  be  done  by  using  large  crosspoint  devices  in  the  AND  array  so  that  a  single 
device  can  discharge  the  precharged  output  line  and  product  term  line  together  within 
the  circuit’s  clock  period.  Unfortunately,  larger  AND  array  crosspoint  devices  lead  to  a 
larger  PLA  that  is  also  slower  due  to  larger  capacitances. 

It  is  possible  that  due  to  a  short  between  a  product  term  line,  Pt  ,  and  an  output 
line,  cm  ,  both  lines  always  assume  the  value  at  which  cm  is  supposed  to  be.  Using 
arguments  similar  to  the  above,  it  can  be  shown  that,  in  this  case,  the  short  is 
undetectable  regardless  of  whether  or  not  there  is  a  device  at  the  crosspoint  of  the  two 
lines  [25].  This  short  can  also  be  made  detectable  by  using  large  AND  array  crosspoint 
devices  that  ensure  that  when  P<  is  at  logic  0,  it  forces  cm  to  logic  0  as  well. 

C.  Shorts  Resulting  in  Simultaneous  Weak  0  and  Weak  1  Faults 

In  this  subsection  we  consider  the  possibility  that,  due  to  a  short,  two  lines  that  are 
supposed  to  carry  complementary  values  are  both  forced  to  a  value  between  logic  0  and 
logic  1.  The  result  is  a  weak  0  fault  on  one  of  the  lines  and  a  weak  1  fault  on  the  other 
line.  Such  shorts  may  be  undetectable  by  any  code  input.  To  show  that  the  circuit  is  not 
self-testing  with  respect  to  such  a  short,  it  is  sufficient  to  show  that  the  fault  is 
undetectable  under  the  worst  possible  combination  of  devices  that  misinterpret  the 
values  on  the  lines. 

1)  A  Short  Between  Adjacent  Product  Term  lines:  As  discussed  in  Subsection  A, 
adjacent  product  term  lines  should  be  connected  to  different  output  lines.  If  a  short 
between  two  product  term  lines,  P*  and  Pj  ,  forces  both  to  a  value  between  logic  0  and 


logic  1  when  they  are  supposed  to  be  carrying  different  values,  this  fault  may  be 
undetectable.  For  a  code  input  XX  ,  the  short  can  affect  the  output  only  if  XX  is 
supposed  to  select  either  Pi  or  P;-  .  Without  loss  of  generality,  assume  that  XX  is 
supposed  to  select  .  All  other  product  term  lines  (including  P;  )  are  not  supposed  to 
be  selected  by  XX  .  However,  a  short  between  Pi  and  P;  can  cause  the  OR  array 
device  connected  to  to  misinterpret  is  as  logic  0  and  the  device  connected  to  P}  to 
misinterpret  is  as  logic  1.  Hence,  despite  the  fault,  only  one  product  term  line  (  P}  )  is 
interpreted  as  being  selected  and  the  output  from  the  circuit  is  a  code  output.  Thus, 
this  short  is  not  detected  by  any  code  input. 

It  can  be  shown  that  there  exists  a  noncode  input  that,  due  to  the  short  between 
product  term  lines,  results  in  code  output.  Hence  this  short,  that  is  not  detectable  by 
code  inputs,  can  mask  noncode  inputs.  Thus,  the  PLA  should  be  laid  out  in  such  a  way 
that  either  this  short  cannot  occur,  or  if  it  does  occur,  both  lines  are  guaranteed  to  be 
forced  to  the  same  logic  value  instead  of  some  value  between  logic  0  and  logic  1. 

We  have  already  shown  that  the  crosspoint  devices  in  the  AND  array  should  be  made 
large  enough  so  that  they  can  pull  down  both  the  product  term  line  and  an  output  line 
that  it  may  be  shorted  to.  Assuming  that  the  same  pull-ups  are  used  for  the  product 
term  lines  and  the  output  lines,  each  crosspoint  device  in  the  AND  array  is  also  able  to 
pull  down  two  product  term  lines.  Hence,  a  short  between  two  product  term  lines  is 
guaranteed  to  force  them  both  to  logic  0  when  they  are  supposed  to  be  carrying 
complementary  values.  It  will  be  shown  in  Section  Vll  that  this  ensures  that  the  short 
can  be  detected  by  some  code  input. 

2)  A  Short  Between  Adjacent  Input  lines:  A  short  between  adjacent  input  lines  may 
also  be  undetectable  by  any  code  input  if,  whenever  the  lines  are  supposed  to  be  carrying 
complementary  values,  both  lines  are  forced  to  a  value  between  logic  0  r.nd  logic  1. 
Consider  a  short  between  two  adjacent  input  lines  a*  and  aj  (h  *  j  ).  There  exists  a 

code  input  XX  -  . . x'0)  for  which  a*  is  supposed  to  be  at  logic  0 

and  aj  is  supposed  to  be  at  logic  1,  Clearly  x*  -  0  and  Xj  =  1  so 

—  (®n-l . xh  +  l>  0  ,XA_! . xj+l>  I-  . x0- 

xn-l . xh  +  l>  1  <xh-l . xJ  +  \‘  0  , xj-l . *o) 

Assume  that  the  single  product  term  line  that  is  supposed  to  be  selected  by  XX  is  . 
Since  x^  =  0  ,  there  is  a  device  CA^  at  the  crosspoint  of  the  input  line  ah  and  the 
product  term  line  Pt  .  Assume  that,  due  to  the  short,  the  value  of  both  ah  and  a;-  is 


forced  to  some  value  between  logic  0  and  logic  1  and  that  CA^  misinterprets  line  to 
be  at  logic  1.  Hence  product  term  line  Px  is  not  selected  by  code  input  XX  .  In  the 
fault-free  circuit,  the  code  input 

YY  ~  •  •  ■  •  xh  + 1  •  0  «XA-1 . xj  + 1*  0  <xj-i . x0* 

xn— 1>  ■  ■  ■  I-  ixA-l . xj  +  l<  1  'XJ-l . xo) 

is  supposed  to  select  product  term  line  Pk  .  Hence  there  is  a  device  CA}k  at  the 
crosspoint  of  input  line  a.j  and  product  term  line  Pk  .  Assume  that,  due  to  the  short, 
when  the  input  is  XX  ,  CAjk  misinterprets  to  be  a  logic  0  although  it  is  supposed  to 
be  at  logic  1.  In  addition,  we  assume  that  CA^  and  CAjk  are  the  only  AND  array 
crosspoint  devices  that  misinterpret  the  values  of  a*  and  a j  (in  particular  CA^ 
interprets  ah  correctly).  Under  these  assumptions,  all  the  input  lines  that  are  supposed 
to  be  at  logic  0  when  the  input  is  YY  are  interpreted  as  being  at  logic  0  by  all  the  AND 
array  crosspoint  devices  connected  to  Pk  when  the  input  is  XX  .  Hence  Pk  is  selected 
by  input  XX  while  Px  is  not  selected  by  XX  .  Since  no  other  crosspoint  devices  are 
effected  by  the  short,  no  other  product  term  line  except  Pk  is  selected  by  XX  .  and  the 
output  is  a  code  output.  This  short  does  not  affect  the  output  from  the  circuit  for  any 
other  code  input  since  such  input  selects  a  product  term  other  then  Pk  or  Pt  .  Hence, 
the  short  is  not  detectable  by  any  code  input. 

In  the  fault-free  circuit,  the  noncode  input 

^  =  •  •  •  >xh+ 1-  0  ixa-i . xj* i»  1  ■ xj-i . xo. 

xn-li  •  ■  •  >XA  +  1>  I-  >xh-l>  •  •  •  'xj  + 1>  1  >xj-l . xo) 

does  not  select  any  product  term  and  the  output  is  noncode.  However,  due  to  the  short 
described  above  between  ak  and  a;-  ,  W  selects  Pk  and  the  result  is  a  code  output 
from  the  circuit.  Hence  this  short,  that  is  not  detectable  by  code  inputs,  masks  a 
noncode  input. 

It  can  be  shown  that  if  the  adjacent  input  lines  are  aA  and  b'j  a  short  between 
these  lines  may  also  be  undetectable  by  code  inputs  and  can  mask  noncode  inputs  [25]. 
Thus,  in  order  to  ensure  that  the  comparator  is  self-testing,  it  is  necessary  to  prevent 
shorts  between  input  lines  that  can  force  both  lines  to  a  value  between  logic  0  and  logic  1 
from  occurring.  This  can  be  done  by  laying  out  the  PLA  so  that  the  separation  b  tween 
input  lines  is  large  enough  that  the  probability  of  a  short  between  them  is  negligible. 
Alternatively,  the  circuits  that  drive  the  inputs  of  the  PLA  can  be  designed  so  that  a 
single  pull-down  device  can  overcome  two  pull-up  devices  so  that  a  short  between  input 
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lines  when  they  are  supposed  to  be  carrying  different  values  will  always  results  in  both 
being  forced  to  logic  0.  Unfortunately,  these  solutions  lead  to  a  larger  PLA  that  is  also 
slower  due  to  larger  capacitances. 

3)  A  Short  Between  an  Input  Line  and  a  Product  Term.  Line:  Using  arguments 
similar  to  the  above,  it  can  be  shown  that  if  a  short  between  an  input  line  and  a  product 
term  line  is  allowed  to  force  both  of  them  to  a  value  between  logic  0  and  logic  1,  an 
undetectable  fault,  that  can  mask  noncode  inputs,  may  result.  Here  again,  one  way  to 
prevent  this  situation  is  to  guarantee  that  when  the  lines  are  supposed  to  be  at 
complementary  values  they  are  both  always  forced  to  logic  0.  This  can  be  done  using 
large  pull-down  devices  in  the  circuits  that  drive  the  inputs  of  the  PLA  and  using  large 
AND  array  crosspoint  devices.  A  single  AND  array  crosspoint  device  or  a  single  pull-down 
in  an  input  driver  must  be  able  to  overcome  both  the  pull-up  device  of  the  input  driver 
and  the  pull-up  device  of  the  product  term  line. 
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D.  Layout  Guidelines  for  Eliminating  Undetectable  Faults 

In  the  previous  three  subsections  we  identified  several  possible  faults  that  are  not 
detectable  by  any  code  inputs.  All  of  these  faults  are  shorts  between  adjacent  or 
crossing  lines.  In  particular,  any  short  that  results  in  both  lines  being  forced  to  a  value 
between  logic  0  and  logic  1  when  they  are  supposed  to  be  carrying  complementary  values 
may  lead  to  an  undetectable  fault.  The  layout  guidelines  for  preventing  these  faults  from 
occurring  in  the  actual  circuit  are  summarized  below. 

(1)  Adjacent  product  term  lines  must  be  connected  to  OR  array  crosspoint  devices  that 
control  different  output  lines. 

(2)  The  AND  array  crosspoint  devices  must  be  large  enough  so  that  a  single  device  can 
pull  down  two  pull-ups  —  a  product  term  line  pull-up  and  an  output  line  pull-up  or 
two  product  term  line  pull-ups. 

(3)  The  circuits  that  drive  the  inputs  of  the  PLA  must  be  designed  so  that  a  single  pull¬ 
down  device  can  overcome  two  pull-up  devices. 


(4)  The  separation  between  adjacent  input  lines  and  between  adjacent  product  term 
lines  should  be  larger  than  the  minimum  separation  required  by  the  design  rules. 
This  can  help  reduce  the  probability  of  a  short  between  adjacent  lines. 
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vn.  The  Self-testing  Property  of  the  comparator 

In  the  previous  section  it  was  shown  that  the  proposed  comparator  is  not  self¬ 
testing  with  respect  to  some  of  the  possible  faults,  unless  certain  guidelines  about  the 
layout  of  the  circuit  1  the  size  of  some  of  its  devices  are  followed.  In  this  section  it  is 
shown  that  the  circw.c  is  self-testing  with  respect  to  all  the  other  faults  in  the  fault 
model.  It  is  assumed  that  some  measures,  such  as  those  discussed  in  the  previous 
section,  are  taken  so  that  the  undetectable  faults  cannot  occur.  In  particular,  it  is 
assumed  that  if  there  is  a  short  between  two  lines  and  the  lines  are  supposed  to  be 
carrying  complementary  values,  the  value  of  one  of  the  lines  is  modified  so  that  they 
both  carry  the  same  logic  value. 

A.  A  Weak  0  and/or  Weak  1  Fault  on  a  Single  Input  Line 

1)  A  Weak  0  Fault:  Assume  that  the  input  line  with  a  weak  0  fault  is  ak  for  some 
k  €ln  .  By  definition,  there  is  at  least  one  AND  array  crosspoint  device,  CA&  ,  connected 
to  ak  that  always  misinterprets  a  logic  0  on  ak  as  a  logic  1.  Hence,  the  device  CA^  is 
always  turned  on.  Thus,  the  product  term  line  P*  that  is  connected  to  CA^  can  never 
be  selected.  Therefore  the  code  input  that  is  supposed  to  select  P<  results  in  no 
product  term  line  being  selected  and  the  output  is  noncode  (l.l).  An  identical  argument 
can  be  made  regarding  a  weak  0  fault  on  a  6';-  {j  ^  In)  input  line. 

In  the  presence  of  a  weak  0  fault  on  one  of  the  input  lines,  for  every  crosspoint 
device  which  misinterprets  the  input  line  to  be  a  logic  1  when  it  is  supposed  to  be  a 
logic  0.  the  code  input  that  selects  the  corresponding  product  term  line  in  the  fault-free 
circuit  results  in  a  (l.l)  output.  Thus  the  number  of  code  inputs  that  detect  this  fault 
varies  between  1  and  2n_l  ,  depending  on  the  number  of  affected  crosspoint  devices. 

2)  A  Weak  1  Fault:  Assume  that  the  line  with  a  weak  1  fault  is  ak  for  some  A:  e/n  . 
By  definition,  there  is  at  least  one  AND  array  crosspoint  device,  G4*j  ,  connected  to  ak 
that  always  misinterprets  a  logic  1  on  ak  as  a  logic  0.  We  denote  the  product  term  line 
connected  to  that  crosspoint  device  by  P*  .  In  the  fault-free  circuit,  P*  is  selected  by 

some  code  input  XX  =  (*n-i.  .Xq.x^^ . x'0)  .  Since  there  is  a  device  at  the 

crosspoint  of  ak  and  P*  ,  the  literal  ak  is  in  the  product  term  that  corresponds  to  Pi 
Hence  xk  =  0  .  Thus, 

XX  —  (Zn ,  . Xk  + 1 ,  0  ,Zfc-i . +  1  ■  •  •  <*c)  • 

In  the  fault-free  circuit,  the  code  input 
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FF  “  C^n-1>  •  •  •  •  ^•jfc  +  1*  1  •■^•<5  —  1 . ^C'^n-1 . “^k*  1*  0  ,Xk-i . ^o) 

selects  some  other  product  term  line  P}  .  Since  CA&  misinterprets  a  logic  1  on  ak  to 
be  a  logic  0,  code  input  YY  selects  P (  .  Since  there  is  no  device  at  the  crosspoint  of  ak 
and  Pj  ,  Pj  is  independent  of  ak  .  Thus  YY  also  selects  Pj  despite  the  fault. 

Since  the  number  of  a*  (i  €.£,  )  inputs  that  are  at  logic  0  in  XX  has  a  different 
parity  from  the  number  of  a*  inputs  that  are  at  logic  0  in  YY .  P, \  and  P}  are 
connected  to  different  output  lines  (see  Equation  (2)  Section  III).  Since  in  the  faulty 
circuit,  the  code  word  YY  selects  both  and  Pj  the  output  is  (0.0).  An  identical 
argument  can  be  made  regarding  a  weak  1  fault  on  a  b'j  (j  €  4  )  input  line. 

B.  A  Short  Between  Two  Adjacent  Input  Lines 
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As  previously  mentioned,  we  assume  that  appropriate  layout  guidelines  are  followed 
so  that  a  short  between  lines  always  forces  both  of  the  lines  to  the  same  logic  value 
rather  then  to  a  value  between  logic  0  and  logic  1.  Since  the  inputs  to  the  comparator 
are  all  the  bits  from  one  of  the  functional  modules  and  the  complements  of  all  the  bits 
from  the  other  module,  no  two  input  lines  are  supposed  to  have  the  same  value  for  all 
code  inputs.  If  the  two  adjacent  shorted  lines  are  a*  and  b'k  (0  £  k  ^  n— 1),  every  code 
input  is  transformed  to  noncode  input  which,  as  previously  shown,  results  in  (0.0)  or  (1.1) 
output.  Any  other  two  input  lines  are  supposed  to  transfer  different  values  for  half  of 
the  code  inputs.  For  these  code  inputs,  the  short  forces  a  change  in  value  on  one  of  the 
lines.  Since  we  assume  that  there  are  no  other  faults,  this  is  equivalent  to  noncode  input 
which,  as  previously  shown,  results  in  (0,0)  or  (1,1)  output. 

C.  A  Weak  0  or  Weak  1  Fault  on  a  Single  Product  Term  Line 

Each  product  term  line  is  connected  to  only  one  output  line.  Hence,  a  weak  0  fault 
on  a  product  term  line  is  simply  a  stuck-at-1  fault  and  a  weak  1  fault  is  a  stuck-at-0 
fault. 

1)  A  Weak  1  (s-a-0)  Fault:  If  one  of  the  product  terms  is  s-a-0,  for  the  code  input 
that  is  supposed  to  select  that  product  term,  all  product  terms  are  set  to  0  and  the 
output  is  (1,1). 
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2)  A  Weak  0  (s-a-l)  Fault:  Assume  that  the  product  term  line  that  corresponds 
to  set  Q  =  Qi  in  Equation  (3),  is  s-a-l.  For  any  code  input  that  selects  a  product  term 
corresponding  to  some  Q  =  Qz^-In  .  where  the  parity  of  |  (?il  and  |  Qz |  are  different, 
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product  terms  connected  to  both  output  lines  are  selected,  and  the  output  is  (0.0).  Thus 
half  the  code  inputs  will  result  in  a  (0,0)  output  due  to  the  s-a- 1  fault  on  Px  . 

D.  A  Short  Between  Two  Adjacent  Product  Term  Lines 

Since  only  one  product  term  line  is  supposed  to  be  selected  by  every  code  input,  for 
any  pair  of  adjacent  product  term  lines,  Pi  and  Pj  there  is  one  code  input  that  is 
supposed  to  select  Px  but  not  Pj  and  there  is  another  code  input  that  is  supposed  to 
select  Pj  but  not  Px  .  We  consider  the  three  possible  effects  of  the  short  when  the  lines 
are  supposed  to  carry  complementary  values: 

(1)  Both  lines  are  always  forced  to  logic  0:  In  this  case,  for  the  two  code  inputs  that 
correspond  to  the  two  product  terms  (i.e.  that  are  supposed  to  select  them),  no  product 
term  line  will  be  set  to  1  and  the  output  will  be  (1,1). 

(2)  Both  lines  are  always  forced  to  logic  1:  Since  the  two  product  term  lines  are 
connected  to  different  output  lines,  for  the  two  code  inputs  that  correspond  to  these 
product  term  lines,  the  output  will  be  (0,0). 

(3)  Both  product  term  lines  always  assume  the  value  of  one  of  the  lines:  Assume  that 
the  two  lines  are  P*  and  Pj  ,  and  that  Pi  always  dominates.  The  code  input  YY ,  that 
is  supposed  to  select  Pj  .  does  not  select  it,  since  Pj  is  pulled  to  logic  0  by  Pt  ,  which  is 
not  selected  by  YY .  Hence  YY  does  not  select  any  product  term  and  the  output  is 
(1,1).  The  code  input  XX ,  that  selects  Px  also  selects  Pj  which  is  pulled  to  logic  1  by 
Pi  .  Since  adjacent  product  term  lines  are  connected  to  different  output  lines,  XX 
results  in  (0,0)  output. 

E.  A  Weak  0  or  Weak  1  Fault  on  a  Single  Output  Line 

The  output  lines  do  not  fan  out  within  the  comparator  circuit.  Thus,  we  need  only 
consider  the  value  on  the  output  line  at  the  point  of  interface  with  the  "outside  world:’ 
Hence,  a  weak  0  fault  on  a  product  term  line  is  equivalent  to  a  stuck-at-1  fault  and  a 
weak  1  fault  is  equivalent  to  a  stuck-at-0  fault. 

Based  on  Equation  (2),  any  code  input  where  the  number  of  a*  (i  e  In)  bits  that  are 
at  logic  0  is  odd,  is  supposed  to  result  in  the  output  (c1(c0)  =  (1,0)  .  Thus,  in  the  faulty 
circuit,  if  c0  is  s-a-1,  the  output  is  (1,1),  and  if  is  s-a-0,  the  output  is  (0,0).  A  similar 
argument  can  be  made  for  any  code  input  where  the  number  of  a*  bits  that  are  at 
logic  0  is  even  and  the  output  is  supposed  to  be  (c  1(c0)  =  (0,1) .  Hence  2n~l  code  inputs 
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will  detect  a  s-a-1  on  c0  and  a  s-a-0  on  while  the  other  2n  1  code  inputs  will  detect 
a  s-a-0  on  c0  and  a  s-a-1  on  ct. 

F.  A  Short  Between  Two  Adjacent  Output  Lines 

There  are  only  two  output  lines  that  are  supposed  to  carry  different  values  for  every 
code  input.  Hence,  a  short  will  result  in  (0,0)  or  (1,1)  output  for  every  code  input. 

G.  A  Short  Between  an  Input  Line  and  a  Crossing  Product  Term  Line 

Assume  that  the  short  is  between  input  line  ak  and  product  term  line  Pt  .  Let  XX 
denote  the  code  input  that  selects  F \  in  the  fault-free  circuit.  We  must  consider  the 
case  where  ak  is  connected  to  a  crosspoint  device  that  is  connected  to  P. *  (  CA& 
exists)  as  well  as  the  case  where  C4**  does  not  exist. 

If  C4*  exists,  every  one  of  the  2n_l  code  inputs  for  which  ak  is  supposed  to  be  at 
logic  1,  is  supposed  to  result  in  Pi  at  logic  0.  The  code  input  XX  is  the  only  code  input 
for  which  ak  is  supposed  to  be  at  logic  0  while  Pi  is  supposed  to  be  at  logic  1.  For  the 
rest  of  the  2n_1— 1  code  inputs,  both  ak  and  Pi  are  supposed  to  be  at  logic  0. 

If  CAm,  does  not  exist,  the  product  term  corresponding  to  Pi  includes  the  literal 
6*  .  For  the  code  inputs  with  6*  at  logic  1,  both  ak  and  Pt  are  supposed  to  be 

at  logic  0.  For  the  code  input  XX  ,  bk  is  supposed  to  be  at  logic  0,  and  both  ak  and 
Pt  are  supposed  to  be  at  logic  1.  For  the  rest  of  the  2n~l-l  code  inputs,  b'k  is 
supposed  to  be  at  logic  0,  ak  is  supposed  to  be  at  logic  1,  and  Pi  is  supposed  to  be  at 
logic  0.  Thus,  if  CA&  does  not  exist,  there  is  no  code  input  for  which  ak  is  supposed  to 
be  at  logic  0  and  Pt  is  supposed  to  be  at  logic  1. 

As  in  the  proof  for  a  short  between  product  term  lines,  we  consider  the  three 
possible  effects  of  the  short  when  ak  and  Pi  are  supposed  to  carry  complementary 
values: 

(l)  Both  lines  are  forced  to  logic  0:  If  C4**  exists,  for  the  code  input  XX  ,  Pi  is 
supposed  to  be  the  only  selected  product  term  line,  while  ak  is  supposed  to  be  at 
logic  0.  We  assume  that,  due  to  the  short,  Pi  is  forced  to  logic  0  by  ak  .  Hence,  no 
product  term  line  is  selected  and  the  output  is  (1,1). 

On  the  other  hand,  if  CAm  does  not  exist,  the  literal  ak  is  not  included  in  the 
product  term  that  corresponds  to  P*  .  Hence,  the  code  input  that  selects  P*  in  the 
fault-free  circuit  is  of  the  form: 


^  =  (^n-L . ^fc  +  l*  1  >xk-\ . ^O^n-l . zfc  +  l*  0  >x'k-l‘  ■  ■  ■  <x'z)  • 

Let  YY  be  one  of  the  2n-1-l  code  inputs  such  that  YY  *  XX  and  YY  is  also  of  the 

form 

YY  ~  (l/n-1*  •  •  •  >yk*l>  1  <Vk- 1 . 1/0'1/n-l . l/fc  +  l>  0  . V  o)  • 

In  the  fault-free  circuit  YY  selects  some  product  term  line  Pj  .  Since  there  is  no  device 
at  the  crosspoint  of  ak  and  P,  ,  Pj  is  independent  of  a*  so  the  short  between  ak  and 
Pi  cannot  affect  Pj  .  Thus  Pj  is  selected  by  YY  despite  the  fault.  In  the  fault-free 
circuit,  the  code  input 

zz  -  (T/n_! . y*+1.  o  ,yk-i - yo.y'n-i . y**i.  i  <y'k- . . y'a) 

selects  some  product  term  P,  .  Since  there  is  no  device  at  the  crosspoint  of  b'k  and 
P,  .  Pt  is  independent  of  6*  .  For  the  code  input  YY  ,  ak  is  supposed  to  be  at  logic  1 

and  Pi  at  logic  0.  Due  to  the  fault,  Pt  forces  ak  to  logic  0.  Therefore,  YY  selects  P, 

as  well  as  Pj  .  Since  the  number  of  Oj  (  i  e  /n  )  inputs  that  are  at  logic  0  in  YY  has  a 
different  parity  from  the  number  of  inputs  that  are  at  logic  0  in  ZZ  ,  Pj  and  P, 
are  connected  to  different  output  lines  (see  Equation  (2)  Section  III).  Hence,  for  the  code 
input  YY  the  output  is  (0.0). 

(2)  Both  lines  are  forced  to  logic  1:  Let  YY  be  one  of  the  2n_z  code  inputs  for 
which  ak  is  supposed  to  be  at  logic  1  and  the  number  of  ai  inputs  that  are  at  logic  0  in 

YY  has  a  different  parity  from  the  number  of  a*  inputs  that  are  at  logic  0  in  XX  .  Due 
to  the  short,  when  the  input  is  YY ,  ak  forces  Pt  (that  is  supposed  to  be  at  logic  0)  to 
logic  1.  In  addition,  as  in  the  fault-free  circuit,  YY  selects  another  product  term  that 
controls  a  different  output  line  from  P<  .  Hence  the  output  from  the  circuit  is  (0,0). 

(3)  Both  lines  are  always  forced  to  the  value  of  ak  or  they  are  always  forced  to 
value  of  Pi  : 

(a)  Line  a*  always  dominates:  The  proof  is  identical  to  case  (2)  above. 

(b)  Line  P*  always  dominates:  There  are  at  least  2n_1-l  code  inputs  of  the  form 

YY  —  (yn-i . Vk*-1<  1  >Vk- 1 . VO'l/n-l'  •  •  •  >y k  + 1>  0  ,y k- !•  •  •  •  -Vo) 

that  do  not  select  Pi  in  the  fault-free  circuit.  In  the  faulty  circuit,  if  Pj  always 
"dominates!'  YY  selects  two  product  term  lines  that  are  connected  to  different  output 
lines.  One  is  the  product  term  line  selected  by  YY  in  the  fault-free  circuit  and  the  other 
is  the  product  term  line  selected  by 


in  the  fault-free  circuit.  Hence,  the  output  is  (0,0). 

H.  A  Short  Between  a  Product  Term  Line  and  a  Crossing  Output  Line 

Assume  that  the  short  is  between  product  term  line  P4  and  output  line  cm  ,  where 
m  e$0,lj  .  Let  m'  denote  0  when  m  is  1  and  denote  1  when  m  is  0.  Let  XX  denote 
the  code  input  that  selects  P4  in  the  fault-free  circuit. 

As  in  the  proof  for  a  short  between  product  term  lines,  we  consider  the  three 
possible  effects  of  the  short  when  P4  and  cm  are  supposed  to  carry  complementary 
values: 

(1)  Both  lines  are  forced  to  logic  0:  In  the  fault-free  circuit  there  are  at  least 
2n_1— 1  code  inputs  that  do  not  select  P4  and  for  which  the  output  is  (cm,cm-)  =  (1,0)  . 
For  any  one  of  these  inputs,  due  to  the  short,  P4  forces  cm  to  logic  0  and  the  output  is 
(0.0). 

(2)  Both  lines  are  forced  to  logic  1:  If  there  is  a  device  at  the  crosspoint  of  P,  and 
cm  (  CO^  exists),  in  the  fault-free  circuit,  for  the  code  input  XX  that  selects  P4  ,  the 

output  is  (cm.cm-)  =  (0.1)  .  In  the  faulty  circuit,  due  to  the  short,  cm  is  forced  to 
logic  1.  Since  none  of  the  product  term  lines  are  affected,  the  output  is  (1.1).  If  CO^ 
does  not  exist,  then,  as  discussed  in  Subsection  B  of  Section  VI,  the  fault  cannot  be 
detected  by  any  code  input. 

(3)  Both  lines  are  always  forced  to  the  value  of  P4  or  they  are  always  forced  to 
value  of  cm  : 

(a)  If  the  value  of  P4  always  "dominates"  the  proof  is  identical  to  case  (1)  above. 

(b)  If  the  value  of  cm  always  "dominates;’  then,  as  discussed  in  Subsection  C  of 
Section  VI.  the  fault  cannot  be  detected  by  any  code  input. 

I.  An  Eetra  Crosspoint  Device  in  the  AND  Array 

In  the  fault-free  circuit,  every  product  term  line,  Pt  ,  is  connected  to  n  crosspoint 
devices  in  the  AND  array.  For  every  code  input,  n  of  the  input  lines  are  at  logic  0  and 
n  are  at  logic  1.  If,  due  to  a  fault,  there  are  n  +  1  crosspoint  devices  connected  to  P4  , 
every  code  input  turns  on  at  least  one  of  these  devices  and  sets  P4  to  logic  0.  Thus,  the 
single  code  input  that  selects  P4  in  the  fault- free  circuit  does  not  select  Pt  in  the 
faulty  circuit.  Hence,  for  that  input,  no  product  term  line  is  selected,  and  the  output  is 
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/  An  Extra  Crosspoint  Device  in  the  OR  Array 

An  extra  crosspoint  device  in  the  OR  array  means  that  there  is  one  product  term 
line,  Pj  that  is  connected  to  both  output  lines.  Hence,  for  the  single  code  input  that 
selects  Pi  ,  the  output  is  (0,0). 

K.  A  Break  in  a  Product  Term  Line 

Each  product  term  line  controls  one  OR  array  crosspoint  device  and  is  controlled  by 
n  AND  array  pull-down  devices  and  one  pull-up  (or  precharge)  device.  All  the  pull-down 
devices  are  connected  to  the  "middle"  of  the  line.  The  pull-up  device  and  the  OR  array 
crosspoint  device  are  either  connected  on  opposite  ends  of  the  product  term  line  (as 
shown  in  Fig.  1)  or  on  the  same  end  of  the  line. 

If  the  product  term  line  pull-up  device  and  the  OR  array  crosspoint  device  are  on 
opposite  ends  of  the  line,  any  break  in  the  product  term  line  prevents  the  segment  of  the 
line  connected  to  the  OR  array  device  from  being  pulled  up.  As  a  result,  the  product 
term  line  is  either  floating  or  stuck-at-0.  If  the  line  is  floating,  its  value  is  constant  and 
independent  of  the  input.  Hence,  in  any  case,  the  product  term  line  segment  that 
controls  the  output  line  is  either  stuck-at-0  or  stuck-at-1.  Earlier  in  this  section  it  is 
shown  that  a  stuck-at  fault  on  a  product  term  line  is  detectable  by  some  code  input. 

If  the  product  term  line  pull-up  device  and  the  OR  array  crosspoint  device  are  on 
the  same  end  of  the  line,  a  break  in  the  product  term  line  disconnects  some  of  the  AND 
array  crosspoint  devices  from  the  segment  of  the  line  connected  to  the  OR  array  device. 
As  a  result,  the  product  term  line  is  selected  when  it  is  not  supposed  to  be  selected.  Let 
Pi  denote  the  product  term  line  that  is  selected  by  the  code  input 

XX  =  (xTV_1 . xh . *o*zn-i . x\ . z'0)  in  the  fault-free  circuit.  A  break  in 

Pi  disconnects  some  AND  array  crosspoint  device,  CA^  ,  from  the  segment  of  Ft  that 
controls  the  OR  array  device.  Since  CA ^  is  controlled  by  input  line  aA  ,  in  the  fault- 
free  circuit,  Pi  can  only  be  selected  if  a,,  =  0.  Hence,  xA  =  0  .  In  the  fault-free  circuit, 

the  code  input  YY  -  (x^,.  .  .  .  ,x\ . x0,x^_| . xh . Xq)  selects  the  product 

term  Pj  where  Ft  and  Pj  are  connected  to  different  output  lines.  Since  the 
crosspoint  device  CA ^  is  disconnected  from  Pi  in  the  faulty  circuit,  Ft  is  not  affected 
by  aA  and  the  code  input  YY  selects  both  Pi  and  Pj  .  Hence,  the  output  is  a  (0,0). 


L.  A  Break  in  an  Output  Line 

Each  output  line  is  controlled  by  2n_1  OR  array  pull-down  devices  and  one  pull-up 
(or  precharge)  device.  All  the  pull-down  devices  are  connected  to  the  "middle”  of  the 
line.  The  pull-up  device  and  the  output  from  the  circuit  are  either  on  opposite  ends  of 
the  output  line  (as  shown  in  Fig.  1)  or  on  the  same  end  of  the  line. 

If  the  output  line  pull-up  device  and  the  circuit  output  are  on  opposite  ends  of  the 
line,  any  break  in  the  output  line  prevents  the  segment  of  the  line  that  serves  as  the 
output  from  the  circuit  from  being  pulled  up.  As  a  result  the  output  line  is  either 
floating  or  stuck-at-O.  If  the  line  is  floating,  its  value  is  constant  and  independent  of  the 
input.  Hence,  in  any  case,  the  segment  of  the  line  that  serves  as  the  circuit  output  is 
either  stuck-at-0  or  stuck-at-1.  Earlier  in  this  section  it  is  shown  that  a  stuck-at  fault 
on  an  output  line  is  detectable  by  some  code  input. 

If  the  output  line  pull-up  device  and  the  circuit  output  are  on  the  same  end  of  the 
line,  a  break  in  the  output  line  disconnects  some  of  the  OR  array  crosspoint  devices  from 
the  segment  of  the  line  that  is  the  output  from  the  circuit.  As  a  result,  the  output  line  is 
selected  when  it  is  not  supposed  to  be  selected.  Let  c,*  denote  the  output  line  with  a 
break.  Let  CO^  denote  an  OR  array  crosspoint  device  that  is  disconnected  from  the 
segment  of  that  serves  as  the  circuit  output.  In  the  fault-free  circuit,  the  product 
term  line  Pt  ,  that  controls  CQtm  .  is  selected  by  the  code  input  XX.  In  the  faulty 
circuit,  due  to  the  break,  the  crosspoint  device  CQ^  cannot  affect  the  output  line  cm  . 
For  the  code  input  XX ,  Pi  is  the  only  selected  product  term.  Hence  CO is  the  only 
OR  array  crosspoint  device  that  is  turned  on.  Therefore  neither  output  line  is  pulled 
down  and  the  circuit  produces  the  noncode  output  (1,1). 

Vm.  IMPLEMENTATION  AND  APPLICATION  CONSIDERATIONS 

In  the  previous  three  sections  it  was  shown  that,  using  a  single  two-level  NOR-NOR 
PLA,  it  is  possible  to  implement  a  comparator  that  is  self-testing  with  respect  to  any 
single  fault  that  is  likely  to  occur  in  MOS  VLSI  circuits.  This  result  is  a  necessary 
prerequisite  for  the  use  of  duplication  and  matching  as  the  basic  scheme  for 
implementing  error  detection.  However,  two  main  problems  remain  to  be  discussed: 
(1)  The  size  of  the  comparator,  implemented  as  a  single  PLA,  grows  exponentially  with  the 
number  of  bits  in  the  two  vectors  to  be  compared.  (2)  It  is  necessary  to  ensure  that  all 
the  code  inputs  will  appear  as  inputs  to  the  comparator  often  enough  so  that  a  complete 


self-test  of  the  comparator  will  be  performed  before  there  is  a  chance  for  multiple  faults 
to  occur  in  the  system. 


In  Section  IV  it  was  shown  that  a  self-testing  comparator  implemented  as  a  single 
two-level  NOR-NOR  PLA.  must  have  2n  product  term  lines.  If  the  output  from  each  one 
of  the  duplicated  functional  modules  is,  say,  16  bits,  this  implementation  is  impractical 
since  it  requires  2ta  =  65536  product  terms.  Fortunately,  efficient  implementations  of  a 
self-testing  two-rail  code  checker  (comparator)  for  large  input  vectors  can  be  achieved 
by  using  checkers  for  smaller  input  vectors  as  cells  that  are  connected  together  in  a  tree 
structure  (Fig.  3)  [15, 28]. 
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Fig.  3:  A  Self- Testing  Pwa-EaU  Code  Checker  Tree 


Each  cell  is  a  self-testing  comparator  for  relatively  small  bit  vectors  (two  to  six  bits 
wide)  which  is  implemented  with  a  single  two-level  NOR-NOR  PLA.  as  outlined  in  Section  HI. 
A  complete  tree  with  h  levels  of  cells,  where  each  cell  is  an  m-bit  comparator,  can  be 
used  to  compare  mh  bits  and  contains  (m/1—  1)/  (m—  1)  cells.  Hence,  if  the  vectors  to 
be  compared  are  n  bits  wide,  the  number  of  levels  in  the  tree  is  flog^nl  while  the  total 
number  of  cells  in  the  tree  is  at  most  (n-1)/ (m-l)  .  Thus  the  number  of  cells  is 
(approximately)  linearly  related  ton  .  Hence,  tree* structured  cellular  implementations 
of  self-testing  comparators  are  practical  for  large  input  bit  vectors. 

In  the  cellular  tree-structured  implementation  of  the  comparator,  a  noncode  output 
from  any  one  of  the  cells  presents  a  noncode  input  to  the  cells  at  the  next  level.  This 
forces  the  output  from  the  entire  tree  to  be  noncode.  Hence,  the  tree-structured 
implementation  preserves  the  self-testing  property  of  the  cells. 

If  duplication  and  matching  is  used  for  error  detection,  the  first  fault  that  occurs  in 
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-be  comparator  must  be  detected  before  additional  faults  can  occur  in  the  comparator 
or  in  the  functional  modules.  Thus,  a  set  of  code  words  that  achieves  a  complete  self-test 
of  the  comparator  must  appear  as  inputs  to  the  comparator  within  a  time  interval  that  is 
significantly  smaller  than  the  mean  time  between  failures  for  the  two  functional  modules 
and  the  comparator  together.  Based  on  the  results  of  sections  IV  and  VII,  a  complete 
self-test  of  a  comparator  implemented  as  a  single  NOR-NOR  PLA  requires  all  2”  code 
words  to  appear  at  the  inputs.  If  n  is  large,  this  requirement  may  imply  that  the 
complete  self-test  takes  so  much  time  that  there  is  an  unacceptably  high  probability 
that  additional  faults  may  occur  in  the  comparator  or  functional  modules  before  the 
self-test  is  completed.  Fortunately,  for  the  tree-structured  cellular  implementation,  the 
number  of  code  inputs  required  for  a  complete-self  test  is  only  2m  ,  where  m  is  the  size 
of  the  bit  vectors  compared  by  each  cell  [5, 15].  Thus,  if  the  cells  are  2-bit  comparators, 
four  code  inputs  are  sufficient  for  a  complete  self-test  of  the  entire  tree. 

Even  with  the  relatively  small  number  of  code  inputs  needed  for  a  complete  self-test, 
it  may  still  be  difficult  to  satisfy  the  requirement  that  certain  code  words  appear  as 
inputs  to  the  comparator  with  some  specified  frequency,  as  implied  by  the  assumptions 
in  Section  III.  We  assume  that  the  system  consists  of  subsystems  that  interact  with  each 
other.  Each  subsystem  is  implemented  with  duplicate  functional  modules  and  a  self¬ 
testing  comparator  so  that  the  failure  of  a  particular  subsystem  is  detected  by  the  other 
subsystems  by  simply  observing  the  output  from  the  corresponding  comparator  [24].  The 
comparison  of  the  outputs  of  the  two  modules  that  make  up  each  subsystem  is  performed 
at  the  interface  between  the  subsystem  and  the  rest  of  the  system 

If  the  "subsystems"  are  low-level  passive  circuits  such  as  an  ALU  or  an  instruction 
decoder  within  a  processor,  it  may  not  be  possible  to  ensure  that  the  necessary  outputs 
will  be  generated  by  the  modules.  Hence,  duplication  and  matching  is  inappropriate  at 
this  leveL 

Duplication  and  matching  is  an  attractive  scheme  for  implementing  error  detection 
if  the  subsystems  are  high-level,  "intelligent"  modules  that  interact  with  similar  modules. 
Examples  of  such  high-level  subsystems  are  the  computation  nodes  or  the 
communication  nodes  in  a  multicomputer  system[24].  In  this  case,  the  subsystem  may 
periodically  initiate  action  that  causes  it  to  generate  all  the  necessary  patterns  at  its 
interface  with  the  other  subsystems.  The  subsystem  initiating  the  self-test  of  its 
comparator  can  Inform  the  other  subsystems  that  the  next  "message"  is  simply  a  test 
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and  should  not  be  interpreted  as  "real  work!’ 

DC  Summary  and  conclusions 

We  have  presented  a  new  fault  model  for  MOS  PLAs.  This  model  incorporates  several 
types  of  faults  that  are  likely  to  occur  in  VLSI  circuits  but  are  not  taken  into  account  in 
previously  published  models.  Using  this  more  realistic  model,  it  was  shown  that  the 
widely  accepted  design  of  a  "self-testing"  comparator  implemented  as  a  NOR-NOR  PLA 
results  in  a  comparator  that  is  self-testing  with  respect  to  any  single  fault  only  if  certain 
guidelines  regarding  the  physical  layout  of  the  circuit  are  followed.  Following  these 
guidelines  requires  additional  silicon  area  and  reduces  performance. 

The  use  of  duplication  and  matching  of  high-level  modules  for  error  detection 
appears  especially  attractive  in  view  of  the  difficulties  in  analyzing  and  verifying  the 
self-testing  properties  of  the  comparator.  Despite  the  simple  structure  of  the  proposed 
self-testing  comparator,  the  analysis  and  verification  of  its  self-testing  properties  are 
surprisingly  lengthy  and  complex.  It  is  therefore  doubtful  that  the  effects  of  faults  on 
large  VLSI  circuits,  such  as  microprocessors,  can  be  predicted  reliably.  Furthermore, 
•while  restrictions  on  the  layout  of  the  comparator  are  acceptable  in  order  to  enhance  its 
testability,  such  restrictions  cannot  be  tolerated  in  the  layout  of  large  chips,  where  one 
of  the  major  goals  is  to  implement  as  much  functionality  as  possible  per  unit  area. 

Since  the  area  taken  up  by  the  comparator  may  be  of  concern,  it  was  shown  that  the 
proposed  comparator  is  optimal  with  respect  to  the  area  it  occupies.  Specifically,  it  is 
shown  that  if  a  single  two-level  NOR-NOR  PLA  is  used  to  implement  a  self-testing 
comparator  of  two  n-bit  vectors,  an  optimal  design  must  include  2n  input  lines,  2n 
product  term  lines  and  2  output  lines.  Furthermore,  2"  code  inputs  are  necessary  for 
a  complete  self-test  of  any  such  circuit. 

The  effectiveness  of  self-testing  comparators  as  critical  elements  in  duplication  and 
matching  schemes  for  error  detection  is  dependent  on  the  system  within  which  they  are 
employed.  If  the  duplicated  functional  modules  are  simple,  low-level,  passive  circuits,  it 
may  not  be  possible  to  ensure  that  the  comparator  will  go  through  a  complete  self-test 
often  enough  and  the  scheme  may  eventually  fail  due  to  an  undetected  fault  in  the 
comparator.  However,  if  the  duplicated  modules  are  high-level  subsystems,  the  use  of 
self-testing  comparators  in  a  duplication  and  matching  scheme  is  an  effective  way  to 
Implement  error  detection  in  VLSI  systems. 
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Abstract 

This  paper  presents  fast,  simple,  and  relatively  accurate  delay  models  for  large 
digital  MOS  circuits.  Delay  modelling  is  organized  around  chains  of  switches 
and  nodes  called  stages,  instead  of  logic  gates.  The  use  of  stages  permits  both 
logic  gates  and  pass  transistor  arrays  to  be  handled  in  a  uniform  fashion. 
Three  delay  models  are  presented,  ranging  from  an  RC  model  that  typically 
errs  by  25%  to  a  slope-based  model  whose  delay  estimates  are  typically  within 
10%  of  SPICE’s  estimates.  The  slope  model  is  parameterized  in  terms  of  the 
ratio  between  the  slopes  of  a  stage’s  input  and  output  waveforms.  All  the 
models  have  been  implemented  in  the  Crystal  timing  analyzer.  They  are 
evaluated  by  comparing  their  delay  estimates  to  SPICE,  using  a  dozen  critical 
paths  from  two  VLSI  designs. 
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1.  Introduction 

Over  the  last  twenty  years,  a  great  deal  of  effort  has  been  spent  in  the 
development  of  transistor  models  for  use  in  simulation  programs.  The  primary 
concern  in  those  models  has  been  accuracy,  the  ability  to  simulate  exactly  the 
real-world  behavior  of  devices.  The  models  have  achieved  high  accuracy  by 
modelling  circuits  with  systems  of  differential  equations.  The  success  of  the 
models  can  be  seen  in  the  popularity  of  circuit  simulation  programs  such  as 
SPICE  [3]. 

Unfortunately,  the  accuracy  of  the  circuit  models  comes  at  a  high  price  in 
execution  time:  circuit  simulators  typically  require  several  seconds  of  CPU 
time  per  transistor.  As  a  result,  the  programs  are  impractical  for  today's 
state-of-the-art  VLSI  circuits,  which  contain  tens  or  hundreds  of  thousands  of 
transistors.  Although  there  have  been  recent  improvements  in  the  speed  of 
circuit  simulators  [2],  they  still  require  too  much  time  for  VLSI  circuits. 

This  paper  describes  a  different  approach  to  transistor  modelling,  where 
speed  is  the  primary  consideration.  The  models  treat  each  transistor  as  a  per¬ 
fect  switch  in  series  with  a  resistor.  Instead  of  solving  differential  equations, 
the  switch-level  models  use  tables  to  compute  the  value  of  the  series  resistance. 
The  switch-level  approach  results  in  four  orders  of  magnitude  improvement  in 
speed:  only  a  few  hundred  microseconds  of  execution  time  are  needed  per 
transistor.  In  spite  of  their  simplicity,  the  switch-level  models  provide  delay 
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estimates  for  digital  MOS  circuits  that  are  typically  within  10/o  of  what 
SPICE  would  estimate  for  the  same  circuits. 

The  switch-level  models  achieve  their  accuracy  and  speed  by  capitalizing 
on  the  uniform  design  style  used  in  large  circuits.  For  example,  digital  VLSI 
circuits  tend  to  have  only  a  few  different  sizes  of  transistor  and  a  few  pullup- 
pulldown  ratios,  used  over  and  over.  Since  the  pieces  of  the  circuit  have  about 
the  same  structure,  they  also  have  about  the  same  delay  properties.  The  delay 
properties  of  the  basic  constructs  can  be  measured  by  running  SPICE  on  small 
examples  and  distilling  the  results  down  to  a  few  tables.  When  analyzing  large 
circuits,  delay  estimates  are  computed  quickly  using  the  tables.  If  there  isn’t 
much  variation  in  the  structures  used  in  the  circuit,  small  tables  will  produce 
accurate  results.  The  approximate  models  tend  not  to  work  as  well  for  sensi¬ 
tive  analog  components  or  circuits  with  large  variation  in  design  style. 

This  paper  describes  and  evaluates  three  switch-level  delay  models  that 
have  been  implemented  in  Crystal,  a  program  that  locates  critical  timing  paths 
in  VLSI  circuits  [5].  Crystal’s  models  contain  two  features  that  permit  fast 
and  accurate  delay  estimates.  First,  circuits  are  decomposed  into  chains  of 
transistors  called  stages.  Each  stage  is  independent  for  purposes  of  delay  cal¬ 
culation.  The  stage  decomposition  permits  Crystal  to  handle  both  logic  gates 
and  pass  transistors  in  a  uniform  fashion.  Second,  each  transistor  is  modelled 
by  an  effective  resistance  whose  value  depends  on  the  shape  of  the  transistor, 
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the  waveform  driving  its  gate,  and  the  load  being  driven  by  the  transistor. 
For  any  given  transistor,  all  of  these  factors  are  combined  together  into  a  sin¬ 
gle  ratio  that  determines  the  effective  resistance.  The  ratio  approach  means 
that  small  tables  can  be  used  to  handle  a  large  variety  of  actual  situations. 

Section  2  describes  the  stage  decomposition  and  the  advantages  it  pro¬ 
vides  over  a  logic-gate  decomposition.  Sections  3-5  present  the  three  delay 
models.  The  lumped  resistor-capacitor  model  of  Section  3  is  the  simplest  and 
fastest,  but  is  only  accurate  to  within  about  25%.  Section  4  describes  the 
more  complex  slope  model,  which  is  usually  accurate  to  within  10%.  Section  5 
applies  the  Penfield-Rubinstein  models  for  distributed  capacitance  [7,9]  to  get 
still  greater  accuracy.  Section  6  discusses  the  limitations  of  the  models,  and 
Section  7  compares  this  work  to  previous  work  in  the  area. 

2.  Stages 

At  any  given  time,  Crystal’s  delay  modeller  considers  a  collection  of 
transistors  called  a  stage.  A  stage  consists  of  a  chain  of  nodes  and  transistor 
channels  forming  an  electrical  path  from  a  strong  signal  source  (such  as  Vdd, 
Ground,  or  a  highly  capacitive  bus)  to  some  other  node,  called  the  output  of 
the  stage.  As  shown  in  Figure  1,  a  single  transistor  may  be  associated  with 
different  stages  during  different  phases  of  the  analysis.  Stages  generally 
correspond  to  logic  gates,  except  that  pass  transistors  are  lumped  together 
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with  the  logic  gates  that  drive  them.  See  [5]  for  information  on  how  stages  are 
selected  in  Crystal. 


The  delay  modeller  is  given  the  sizes  and  types  of  the  transistors  in  the 
stage,  and  the  parasitic  resistances  and  capacitances  of  the  nodes  along  the 
stage.  One  of  the  transistors  is  identified  as  the  trigger,  it  is  assumed  to  be 
the  last  transistor  to  turn  on  in  the  stage.  Its  gate  is  called  the  input  of  the 
stage,  since  it  controls  the  activation  of  the  stage.  The  modeller  is  also  given 
information  about  the  waveform  at  the  input.  Its  job  is  to  use  this  informa¬ 
tion  to  compute  the  waveform  at  the  output  of  the  stage,  assuming  that  all  the 
transistors  in  the  stage  except  the  trigger  are  turned  on. 

Waveforms  are  described  at  different  levels  of  precision  in  different  delay 
models.  In  the  simplest  delay  model,  each  waveform  is  described  by  a  single 
value:  the  time  at  which  its  voltage  crosses  the  logic  threshold  (the  input  vol¬ 
tage  for  a  standard  inverter  where  the  input  and  output  voltages  are  equal  in 


Figure  1.  A  stage  is  a  chain  of  transistors  analyzed  together  for  delay  calculations. 
During  different  phases  of  analysis,  the  stages  in  (b)  and  (c)  might  be  extracted  from 
the  circuit  in  (a)  for  delay  calculation.  The  trigger  transistor  Ls  the  last  one  in  the 
stage  to  turn  on. 
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Input 
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Figure  2.  In  computing  delays,  information  from  side  paths  is  not  included.  For 
example,  in  this  figure  the  capacitance  at  A  and  B  is  not  included.  However,  the 
gate-source  overlap  capacitance  from  the  side  transistors  (Cl  and  C2)  is  included  in 
the  parasitic  capacitance  of  the  stage. 


the  dc  transfer  curve).  This  is  called  the  inversion  time  for  the  waveform.  In 
the  slope-based  models  each  waveform  is  also  characterized  by  its  slope  (in 
ns/volt)  at  the  logic  threshold  voltage.  This  is  called  the  rise-time  of  the 
waveform. 


Crystal’s  delay  modeller  only  considers  information  on  the  direct  path 
between  signal  source  and  output.  All  side  transistors  connecting  to  the  path 
are  assumed  to  be  turned  off:  their  gate-source  capacitance  is  included  in  the 
parasitic  capacitance,  but  information  on  the  far  side  of  side  transistors  is 
ignored  (see  Figure  2).  This  approach  is  used  in  Crystal  because  the  program 
does  not  have  specific  information  about  whether  side  transistors  are  turned  on 
or  off;  if  it  automatically  included  all  side  capacitance,  its  delay  estimates 
would  be  unrealistically  high.  Busses  and  pass  transistor  arrays  account  for 
most  of  the  situations  with  many  side  paths,  and  in  these  cases  only  a  single 
path  is  usually  active  through  the  structure  at  once.  (If  it  becomes  necessary 
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to  include  side  paths,  the  results  of  Section  5  can  be  extended  to  handle  them). 
Most  stages  in  VLSI  circuits  are  simple:  in  the  critical  paths  found  by  Crystal, 
80-90%  of  all  stages  contain  only  a  single  transistor. 

Using  stages  as  the  basis  for  delay  computation  has  worked  out  well, 
because  both  normal  gates  and  pass  transistors  can  be  handled  in  a  uniform 
fashion.  Most  previous  timing  analyzers  have  been  organized  around  logic 
gates.  They  have  generally  had  difficulty  dealing  with  pass  transistors  because 
the  non-linear  pass  transistor  effects  cannot  be  separated  cleanly  from  the  rest 
of  the  gate.  The  stage  approach  handles  gates  with  or  without  pass  transistor 
structures  uniformly  as  chains  of  switches.  It  also  accomodates  complex  gates 
such  as  AND-OR-INVERT. 

3.  The  RC  Model 

The  RC  model  is  the  simplest  and  least  accurate  of  Crystal’s  models.  It 
computes  a  resistance  and  capacitance  value  for  each  node  and  transistor 
along  the  stage.  The  resistances  and  capacitances  are  summed,  and  the  pro¬ 
duct  of  these  two  lumped  values  is  used  as  the  delay  time  for  the  stage.  No 
information  about  waveforms  is  used  by  the  RC  model:  the  delay  time  for  the 
stage  is  added  to  the  inversion  time  at  the  input  to  calculate  the  inversion 
time  at  the  output. 
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Transistor  Ty  pe 


Enhancement _ 

Enhancement  driven 
by  pass  transistor 
Depletion  Load 
Super-buffer 
(depletion,  gate  1) 
Depletion 

(gate  0  or  unknown) 


Ohms/square  Ohms/square 
;  (transmitting  1)  [  (transmitting  0) 


30000 


22000 


50000 


15000 


48000 


Table  I.  The  effective  resistances  used  by  the  RC  model  for  nMOS  circuits.  The 
missing  entries  are  for  situations  that  do  not  ever  occur  (for  example,  depletion  loads 
are  used  only  to  transmit  l's) 

The  effective  resistance  of  each  transistor  in  the  stage  is  computed  using  a 
table  based  on  the  transistor’s  type  (enhancement,  depletion,  etc.)  and 
geometry.  For  each  type  of  transistor  there  are  two  effective  resistance  values, 
each  expressed  in  ohms  per  square.  The  first  resistance  value  is  used  if  the 
transistor  is  transmitting  a  logic  one  and  the  second  is  used  if  the  transmitter 
is  transmitting  a  logic  zero.  The  values  currently  used  in  Crystal  are  given  in 
Table  I.  The  effective  resistance  of  a  transistor  is  computed  by  selecting  the 
appropriate  parameter  value  and  multiplying  it  by  the  length/width  ratio  of 
the  device. 

The  type  of  a  transistor  is  not  determined  solely  by  its  physical  structure 
(enhancment,  depletion,  p-channel,  etc.)  but  also  by  the  way  it  is  used  in  the 
circuit.  For  example,  enhancement  transistors  driven  through  pass  transistors 
have  different  characteristics  than  enhancement  transistors  driven  directly  by 
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depletion  loads.  When  reading  in  circuits,  Crystal  distinguishes  these  two  uses 
of  enhancement  transistors  and  use  different  types  for  each.  Three  different 
uses  of  depletion  devices  are  also  distinguished  by  Crystal,  since  they  can 


result  in  different  behavior.  Users  can  add  new  types  of  their  own  and  label 
transistors  as  being  of  the  new  types;  this  provides  a  crude  facility  for  dealing 
with  special  circuit  constructs  such  as  bootstrap  drivers. 

The  table  values  are  generated  by  running  SPICE  simulations  of  simple 
stages.  In  the  SPICE  simulations  a  step  function  is  used  to  drive  the  input, 
and  the  effective  resistance  is  computed  by  dividing  the  delay  time  by  the  load 
capacitance.  It  will  be  shown  below  that  this  results  in  an  underestimation  of 
effective  resistance. 

Capacitance  includes  parasitics  from  the  nodes,  gate-channel  capacitance 
from  transistors  in  the  path,  and  gate-source  capacitance  from  side  transistors 
attached  to  the  path.  When  summing  the  capacitances  along  the  path,  only 


Output 


aHLc1 


Figure  3.  A  switch-level  approach  to  delay  analysis  automatically  accounts  for 
different  delay  characteristics  at  different  inputs  of  a  NAND  gate.  The  capacitance 
at  Cl  and  C£will  be  included  when  calculating  delays  from  A,  but  not  from  B. 
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delay 
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Figure  4.  A  comparison  between  the  RC  model  and  SPICE,  using  12  critical  paths 
(157  stages)  extracted  by  Crystal  from  two  large  chips.  Each  point  compares 
Crystal’s  delay  estimate  for  a  stage  with  the  corresponding  SPICE  time,  measured 
from  a  simulation  of  the  critical  path.  Ideally,  all  points  should  fall  along  the  diago¬ 
nal. 


those  between  the  trigger  transistor  and  the  output  are  used.  The  trigger 
transistor  is  the  last  to  turn  on,  so  all  the  capacitance  between  it  and  the  sig¬ 
nal  source  is  assumed  to  have  discharged.  This  means  that  different  delays 
will  be  computed  from  each  input  of  a  NAND  gate  (see  Figure  3). 

Two  large  circuits,  a  microprocessor  (1]  and  an  instruction  cache  (6],  were 
used  to  compare  the  RC  model  to  SPICE.  Out  of  the  40000-50000  transistors 
in  each  chip,  Crystal  used  the  RC  model  to  extract  12  critical  paths  containing 
a  total  of  157  stages.  SPICE  was  used  to  simulate  each  critical  path,  and  the 
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SPICE  delays  were  compared  with  Crystal’s  estimates.  Figure  4  makes  a 
stage-by-stage  comparison  and  Figure  5  compares  total  delays  through  the 
critical  paths.  Although  the  RC  model  often  errs  by  a  factor  of  2  or  more  on 
individual  stage  calculations,  the  errors  of  successive  stages  tend  to  cancel.  On 
average,  it  can  usually  estimate  the  total  delay  through  a  path  to  within  25% 
of  SPICE. 

Much  of  the  RC  model’s  error  is  due  to  consistent  underestimation:  the 
sum  of  all  the  RC  delays  is  20%  less  than  the  sum  of  all  SPICE  delays.  This 
means  that  if  all  the  effective  resistances  were  simply  scaled  by  1.2,  then  the 


Figure  5.  A  comparison  between  the  RC  model  and  SPICE,  using  total  delays 
through  critical  paths.  Each  line  corresponds  to  one  critical  path,  and  each  point  in 
the  line  corresponds  to  a  stage.  The  point  compares  Crystal's  and  SPICE's  estimates 
for  the  total  delay  in  the  critical  path  up  through  that  stage. 
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RC  delay  estimates  would  generally  fall  within  10-15%  of  SPICE's  estimates, 
at  least  for  these  test  cases.  Unfortunately,  it  isn’t  obvious  whether  or  not 
such  a  scale  factor  depends  on  the  circuit  constructs  or  fabrication  parameters, 
nor  is  it  obvious  how  to  characterize  such  a  dependency  if  one  exists.  For 
these  reasons,  Crystal  has  not  used  the  scale  factor  approach. 

There  are  two  sources  of  error  in  the  RC  model.  One  source  of  error  is 
the  lumping  of  resistances  and  capacitances.  This  tends  to  overestimate  the 
delays  since  it  assumes  that  all  the  capacitance  must  be  discharged  through  all 
the  resistance.  Fortunately,  most  stages  contain  only  a  single  transistor  and 
small  parasitic  resistances,  so  lumping  introduces  only  a  small  error  and  is  not 
the  major  problem  with  the  RC  model.  Section  5  will  improve  on  the  lumped 
model  by  applying  the  Penfield-Rubinstein  models  for  distributed  capacitance 

M- 

The  second  and  most  significant  source  of  error  in  the  RC  model  comes 
from  its  inability  to  deal  with  waveform  shape.  In  practice,  the  effective  resis¬ 
tance  of  a  transistor  depends  on  the  waveform  on  its  gate.  If  the  trigger 
transistor  turns  on  instantaneously,  then  its  full  driving  power  is  used  to  drain 
the  output  capacitance  and  the  transistor  has  a  relatively  low  effective  resis¬ 
tance.  If  the  trigger  turns  on  slowly,  then  it  may  do  much  or  all  of  its  work 
while  only  partially  turned-on.  In  this  case  its  effective  resistance  will  be 
higher.  In  the  extreme  case  of  a  very  slow  trigger,  the  output  will  settle  as 
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quickly  as  the  input  changes,  so  the  output  voltage  is  determined  by  the  input 
waveform  and  the  dc  transfer  characteristics  of  the  stage.  The  output  will 
reach  the  logic  threshold  at  exactly  the  same  instant  that  the  input  reaches  the 
logic  threshold;  since  delays  are  defined  at  the  logic  threshold  voltage,  the 
stage  will  have  zero  delay  and  the  transistor  will  have  zero  effective  resistance. 

If  all  waveforms  in  a  circuit  have  the  same  shape,  then  the  effective  resis¬ 
tances  of  transistors  can  be  characterized  using  that  waveform  and  the  RC 
model  will  produce  accurate  results.  Unfortunately,  this  is  not  the  case  in 
actual  VLSI  circuits.  Although  most  of  the  waveforms  have  an  exponential 
shape,  they  vary  by  more  than  three  orders  of  magnitude  in  their  slopes.  As  a 
result,  the  effective  resistance  of  the  transistors  varies  by  more  than  a  factor  of 
ten  and  the  RC  model  produces  only  a  rough  estimate  for  delays. 

4.  The  Slope  Model 

The  slope  model  incorporates  information  about  waveform  shape  in  order 
to  make  more  accurate  delay  estimates.  It  assumes  that  all  waveforms  are 
exponential  in  overall  shape  but  vary  in  their  slopes.  Each  waveform  is 
represented  by  its  inversion  time  and  its  rise-time,  in  ns/volt  at  the  logic  thres¬ 
hold.  A  rise-time  of  zero  corresponds  to  a  step  function,  and  a  large  rise-time 
corresponds  to  a  slowly  rising  or  falling  signal. 
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Unfortunately,  the  effective  resistance  of  a  transistor  depends  not  just  on 
its  gate’s  rise-time,  but  also  on  the  load  being  driven  by  the  stage  and  on  the 
sizes  of  the  transistors  in  the  stage.  If  a  stage  is  driving  a  large  load,  or  has 
very  small  transistors,  then  only  very  slow  input  rise-times  will  affect  the 
stage’s  delay.  If  a  stage  is  driving  a  small  load  or  has  very  large  transistors,  its 
delay  will  be  more  sensitive  to  the  rise-time  of  its  input.  The  important  issue 
is  whether  or  not  the  trigger  transistor  is  fully  turned-on  when  it  does  most  of 
its  work,  and  this  depends  on  the  input  rise-time,  the  output  load,  and  transis¬ 
tor  size. 


The  key  to  implementing  the  slope  model  was  the  discovery  that  all  of 
these  factors  can  be  combined  into  a  single  ratio,  which  alone  determines  the 
transistor’s  effective  resistance.  In  the  slope  model,  the  output  load  and 
transistor  size  are  characterized  by  the  stage’s  intrinsic  rise-time,  which  is  the 
rise-time  that  would  occur  at  the  output  if  the  input  were  driven  by  a  step 
function.  The  input  rise-time  of  a  stage  is  the  divided  by  the  intrinsic  rise¬ 
time  of  the  stage’s  output  to  produce  the  rise-time  ratio  for  the  stage.  An 
extensive  series  of  SPICE  simulations  verified  that,  to  a  very  close  approxima¬ 
tion,  the  effective  resistance  of  a  transistor  is  a  function  only  of  the  rise-time 
ratio.  This  approximation  holds  across  the  entire  range  of  structures  that 
occur  in  our  digital  circuits.  Pilling  and  Skalnik  were  apparently  the  first  to 
suggest  the  ratio  approach  [81. 
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Figure  6.  A  plot  of  the  tables  that  characterize  the  resistance  of  different  types  of 
transistors  as  a  function  of  the  rise-time  ratio  of  the  stage.  A  rise-time  ratio  of  zero 
means  the  stage’s  input  rises  very  quickly  in  comparison  to  the  output.  Different 
tables  are  used  when  transmitting  zero  (e.g.  the  "enhancement  down”  curve)  and 
transmitting  one  (e.g.  the  "enhancement  up"  curve). 

In  the  slope  model,  each  transistor  type  is  characterized  by  two  resistance 
tables,  one  used  when  the  transistor  is  transmitting  a  logic  0  and  one  used 
when  it  is  transmitting  a  logic  1.  Each  table  gives  effective  resistance  values  as 
a  function  of  rise-time  ratio.  During  timing  analysis,  the  effective  resistance  of 
each  trigger  transistor  is  computed  by  interpolating  in  the  appropriate  table. 
All  transistors  except  the  trigger  are  assumed  to  be  fully  turned-on,  so  there  is 
no  need  to  interpolate  for  them  (the  effective  resistance  for  the  RC  model, 
which  is  measured  with  a  zero  rise-time  ratio,  can  be  used). 

The  parameter  tables  for  the  slope  model  were  generated  in  much  the 
same  way  as  for  the  RC  model:  SPICE  simulations  were  run  on  simple  stages 
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and  parameters  were  extracted  from  the  output.  Each  table  contains  six  to 
ten  values.  Figure  6  plots  the  contents  of  a  few  of  the  tables  for  our  nMOS 
process.  As  with  the  RC  models,  non-standard  circuit  structures  can  be  han¬ 
dled  by  giving  their  transistors  different  types  and  generating  special  parame¬ 
ter  tables  for  them. 

The  ratio  approach  is  important  because  it  allows  transistors  to  be 
parameterized  with  one-dimensional  tables,  instead  of  three-dimensional  tables 
based  on  input  rise-time,  load,  and  transistor  size.  The  three-dimensional 
approach  would  require  large  amounts  of  CPU  time  to  generate  the  tables  and 
would  also  make  the  delay  analysis  slower  by  requiring  three-dimensional 
interpolation. 

The  accuracy  of  the  slope  model  depends  on  the  accuracy  with  which 
rise-times  can  be  calculated.  It  is  the  responsibility  of  the  delay  modeller  to 
estimate  the  rise-time  at  the  output  of  each  stage;  this  value  is  used  as  the 
input  rise-time  of  later  stages.  The  current  implementation  of  the  slope  model 
approximates  actual  output  rise-times  by  using  intrinsic  rise-times  everywhere. 
The  intrinsic  rise-time  is  computed  under  the  assumption  that  the  input  is  a 
step  function  and  the  output  has  an  exponential  shape.  For  any  given  voltage, 
the  slope  of  an  exponential  waveform  at  that  voltage  is  proportional  to  the 
delay  time  to  reach  that  voltage.  This  means  that  the  intrinsic  rise-time  of  a 
stage  is  proportional  to  its  intrinsic  delay,  which  is  the  delay  computed  by  the 
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RC  model. 


The  intrinsic  rise-time  is  a  lower  bound  on  the  actual  rise-time  of  a  stage’s 


output:  if  the  input  has  a  large  rise-time,  then  the  actual  rise-time  at  the  out¬ 


put  will  be  larger  than  the  intrinsic  rise-time.  However,  initial  attempts  at 


producing  more  accurate  rise-time  estimates  did  not  produce  noticeable 


improvements  in  the  overall  accuracy  of  the  model.  (Note  to  referees:  this 


rise-time  work  is  still  in  progress,  and  I  expect  to  evaluate  it  more  thoroughly 


in  time  for  the  final  conference  version  of  the  paper.) 
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Figure  7.  A  stage-by-stage  comparison  between  the  slope  model  and  SPICE,  for 
tbe  same  critical  paths  as  in  Figure  A. 
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SPICE  delay  (ns) 

Figure  8.  A  total-delay  comparison  between  the  slope  model  and  SPICE,  using  the 
same  critical  paths  as  in  Figure  5. 

Figures  7  and  8  compare  the  slope  model  to  SPICE  with  the  same  data  as 
in  Figures  4  and  5.  These  figures  show  that  the  slope  model  is  substantially 
more  accurate  than  the  RC  model.  The  average  error  for  individual  stages 
was  reduced  from  45%  to  23%,  and  the  average  overall  error  over  several  con¬ 
secutive  stages  dropped  from  24%  to  only  8%.  Only  rarely  does  the  estimate 
for  a  critical  path  differ  from  SPICE  by  more  than  20%.  Table  II  summarizes 
this  data  and  the  corresponding  data  for  the  additional  enhancement  described 
in  Section  5. 
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Model 

Overall 

Error 

Av.  Stage 
Error 

Av.  Path 
Error 

Std.  Deviation 
in  Path  Error 

RC  Model 

-22% 

45% 

24% 

10% 

Slope  Model 

4% 

23% 

8% 

9% 

PR-Slope  Model 

2% 

20% 

6% 

7% 

Table  II.  A  comparison  of  the  three  delay  models.  “Overall  Error”  compares  the 
sum  of  all  the  delays  computed  by  Crystal  with  the  sum  of  all  delays  computed  by 
SPICE,  to  point  out  consistent  underestimates  or  overestimates.  “Av.  Stage  Error" 
is  the  average  of  the  absolute  value  of  errors  for  individual  stages  (the  points  in  Fig¬ 
ures  4  and  7).  “Av.  Path  Error”  is  the  average  absolute  error  in  estimating  total  de¬ 
lays  in  the  critical  paths  up  to  each  stage  (the  points  in  Figures  5  and  8).  The  right¬ 
most  column  gives  the  standard  deviation  in  the  errors  from  the  “Av.  Path  Error” 
column. 

6.  Complex  Stages  and  the  PR-Slope  Model 

Although  most  stages  in  MOS  circuits  contain  only  a  single  transistor,  the 
delay  modeller  must  still  be  able  to  make  reasonable  delay  estimates  for  more 
complex  stages.  In  the  RC  and  slope  models,  complex  stages  are  handled  by 
lumping  resistances  and  capacitances.  The  lumped  approach  is  pessimistic  for 
complex  paths  with  distributed  capacitance,  since  it  assumes  that  all  the  capa¬ 
citance  must  be  discharged  through  all  the  resistance  (see  Figure  9  for  an 
example). 


Output 


Figure  0.  The  basic  slope  model  will  compute  delays  under  the  assumption  that  the 
capacitance  at  C  must  discharge  through  both  transistors,  whereas  in  fact  it  only 
discharges  through  one. 


Switch-Level  Delay  Models  for  Digital  MOS  VLSI 


November  20,  1983 


In  order  to  handle  such  situations  more  accurately,  a  new  model  called 
PR-alope  was  implemented.  It  b  similar  to  the  slope  model  except  that  it  uses 
the  results  of  Penfield  and  Rubinstein  (7,9]  to  avoid  lumping  the  capacitance. 
[9]  provides  upper  and  lower  bounds  for  the  delay;  Crystal  uses  the  average  of 
the  two.  Since  each  stage  is  a  linear  chain  of  transbtors  with  no  side  trees,  the 
average  of  the  upper  and  lower  bounds  b 

delay 

I 

Ri  b  the  resbtance  from  the  signal  source  to  point  i,  and  C{  b  the  capacitance 
at  point  i.  Thb  means  that  instead  of  weighting  all  the  capacitance  by  all  the 
resbtance,  each  separate  capacitor  b  weighted  only  by  the  resbtance  between 
it  and  the  signal  source. 

The  addition  of  the  Penfield-Rubinstein  modeb  made  an  additional 
improvement  in  accuracy,  which  b  summarized  in  Table  I.  The  average  error 
for  a  single-stage  estimate  dropped  from  23%  to  20%,  and  the  average  error 
over  paths  containing  several  stages  dropped  from  8%  to  6%.  The  small 
improvement  supports  the  conclusion  that  complex  stages  are  rare  in  large  cir¬ 
cuits  and  suggests  that  the  main  remaining  cause  of  error  is  inaccuracy  in 
estimating  rbe-times.  However,  the  PR-slope  model  b  almost  as  efficient  as 
the  basic  slope  model,  and  can  avoid  gross  over-estimates  that  will  occasionally 
happen  in  the  basic  slope  model. 
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6.  Limits  of  the  Models 

The  models  presented  here  are  nearly  as  accurate  as  SPICE  over  a  wide 
range  of  digital  circuit  constructs  and  loading  characteristics.  There  are  three 
limitations  to  the  models,  however.  The  first,  inaccuracy  in  estimating  rise- 
times,  was  discussed  in  Section  4.  The  other  two  limitations  are  discussed  in 
the  paragraphs  below.  They  have  to  do  with  complex  stages  and  the  number 
of  transistor  types. 

Although  the  PR-slope  model  makes  a  first-order  attempt  to  deal  with 
complex  stages,  there  are  still  several  situations  that  can  cause  it  to  produce 
inaccurate  results.  One  source  of  error  in  complex  stages  is  the  assumption 
that  only  a  single  transistor  is  turning  on  or  off  at  once.  If  two  transistors  turn 
on  simultaneously  in  a  NOR  gate,  the  gate’s  delay  will  be  less  than  predicted; 
if  two  transistors  turn  on  simultaneously  (and  slowly)  in  a  NAND  gate,  its 
delay  will  be  greater  than  predicted.  Another  source  of  error  in  complex 
stages  is  the  additive  treatment  of  different  transistors  and  resistors:  total 
resistances  and  rise-times  for  stages  are  computed  by  summing  the  contribu¬ 
tions  of  each  device  along  the  stage.  Although  this  is  an  accurate  approxima¬ 
tion  for  resistors,  it  is  less  accurate  for  non-linear  devices  such  as  pass  transis¬ 
tors. 

The  third  potential  limitation  of  the  models  has  to  do  with  the  number  of 
transistor  types.  When  transistors  are  used  in  different  ways,  such  as 
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bootstrap  drivers,  super  buffers,  or  even  gates  with  different  pullup/pulldown 


ratios,  they  must  be  modelled  with  separate  transistor  types.  In  general,  each 


construct  with  a  separate  dc  transfer  characteristic  must  be  modelled 


separately.  SPICE  simulations  must  be  run  to  extract  the  parameters  for  each 


type.  In  principle  this  allows  an  unlimited  number  of  different  circuit  con¬ 


structs:  every  transistor  in  the  circuit  could  ostensibly  be  given  a  different 


type.  However,  the  number  of  different  transistor  types  is  limited  in  practice 


by  the  human  and  computer  time  that  must  be  expended  to  extract  all  their 


parameters.  If  a  large  number  of  transistor  types  is  required  to  model  a  circuit 


accurately,  too  much  parameter  extraction  time  will  be  required  for  the 


approach  to  be  reasonable.  Fortunately,  our  current  design  style  at  U.C. 


Berkeley  is  uniform  enough  that  a  half  dozen  different  transistor  types  is 


enough. 


7.  Related  Work 


The  importance  of  using  waveform  information  in  computing  delays  has 


been  recognized  for  some  time  [4,8,10,11].  In  [11],  analytic  models  were 


developed  to  model  the  effects  of  waveforms  and  those  models  were  validated 


against  circuit  simulation,  but  only  over  a  relatively  small  range  of  rise-time 


ratios.  As  a  consequence,  Tokuda  et  al.  concluded  that  load  transistors  were 


insensitive  to  waveform.  Tamura  ct  al.  have  also  developed  an  analytical 
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model  for  waveform  effects  [10],  but  their  model  appears  to  apply  only  to  sim¬ 
ple  gates  without  pass  transistors.  The  ratio  approach  to  modelling  waveform 
effects  was  suggested  in  1972  by  Pilling  and  Skalnik  [8],  although  they  also  did 
not  deal  with  pass  transistors,  and  the  idea  doesn’t  appear  to  have  been  used 
since  then. 

8.  Conclusions 

This  paper  shows  that  simple  models  can  estimate  delays  in  MOS  digital 
circuits  to  within  10%  of  SPICE.  Since  other  factors,  such  as  variations  in 
processing,  are  likely  to  cause  errors  at  least  as  great  as  this,  the  accuracy  of 
the  simple  models  appears  to  be  acceptable  for  a  wide  variety  of  applications. 
The  models  are  fast:  delays  can  be  estimated  for  typical  stages  in  less  than 
one  millisecond  using  the  PR-slope  model. 

Two  factors  contribute  to  the  success  of  Crystal’s  models.  The  first  factor 
is  the  switch-level  approach  based  on  stages:  it  enables  the  system  to  handle  a 
variety  of  different  circuit  constructs  in  a  uniform  fashion.  The  second  factor 
is  the  use  of  rise-time  ratios,  which  results  in  small  parameter  tables  and  fast 


interpolation. 
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Abstract 

Magic  is  a  “smart”  layout  system  for  integrated  circuits.  It  incorporates  ex¬ 
pertise  about  design  rules  and  connectivity  directly  into  the  layout  system  in 
order  to  implement  powerful  new  operations,  including:  a  continuous  design- 
rule  checker  that  operates  in  background  to  maintain  an  up-to-date  picture  of 
violations;  an  operation  called  plowing  that  permits  interactive  stretching  and 
compaction;  and  routing  tools  that  can  work  under  and  around  existing  con¬ 
nections  in  the  channels.  Magic  uses  a  new  data  structure  called  comer  stitch - 
ing  to  achieve  an  efficient  implementation  of  these  operations. 

Keywords  and  Phrases:  interactive  layout  editor,  corner  stitching,  design- 
rule  checking,  routing,  stretching,  compaction. 
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1.  Introduction 

Magic  is  a  new  interactive  layout  editing  system  for  large-scale  MOS  cus¬ 
tom  integrated  circuits.  The  system  contains  knowledge  about  geometrical 
design  rules,  transistors,  connectivity,  and  routing.  Magic  uses  its  knowledge 
to  provide  powerful  interactive  operations  that  simplify,  the  task  of  creating 
layouts.  Moreover,  once  a  layout  has  been  entered,  Magic  makes  it  easy  to 
modify  it;  this  permits  designers  to  fix  bugs  easily,  experiment  with  alterna¬ 
tive  designs,  and  make  performance  enhancements. 

Magic  provides  several  new  operations  for  its  users.  Design  rules  are 
checked  continuously  and  incrementally  during  editing  sessions  to  keep  up-to- 
date  information  about  violations.  When  the  layout  is  finished,  then  so  is  the 
design-rule  check.  A  new  operation  called  plowing  allows  layouts  to  be  com¬ 


pacted  and  stretched  while  observing  all  the  design  rules  and  maintaining  cir¬ 
cuit  structure.  Routing  tools  are  provided  that  can  work  under  and  around 
existing  wires  in  the  channels  (such  as  power  and  ground  routing)  while  still 
providing  the  traditional  efficiency  of  a  channel  router. 

Two  aspects  of  Magic’s  implementation  make  the  new  operations  possible. 
First,  the  system  is  based  on  a  data  structure  called  comer  stitching  which  is 
both  simple  and  efficient  for  a  variety  of  geometrical  operations  [6].  Without 
corner  stitching,  most  of  Magic’s  new  operations  would  be  too  slow  for  interac¬ 
tive  use.  Second,  designs  in  Magic  are  specified  using  abstract  layers,  rather 
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than  actual  mask  layers.  The  abstract  layers  represent  circuit  structures  such 
as  contacts  and  transistors  in  a  form  that  appears  somewhat  like  sticks  [14] 
except  that  objects  are  seen  in  their  actual  sizes  and  positions.  The  abstract 
layers  incur  no  density  penalty,  but  they  simplify  the  designer’s  view  of  the 
system  and  provide  more  explicit  information  about  the  circuit  structure. 

This  paper  gives  an  overview  of  the  Magic  system.  Section  2  describes 
the  specific  problems  Magic  attempts  to  solve,  and  the  overall  approach  of  the 
system.  Sections  3  and  4  describe  the  data  structure  and  abstract  layers  used 
in  the  Magic  implementation.  Sections  5-11  discuss  Magic’s  new  operations, 
and  Section  12  presents  the  implementation  status  of  the  system.  Three  addi¬ 
tional  papers  in  this  technical  report  discuss  design-rule  checking,  plowing,  and 
routing  in  detail  [2,11,12]. 

2.  Background  and  Goals 

Our  previous  layout  editing  systems,  Caesar  [5,7]  and  KIC2  [3],  have  been 
used  since  1980  for  a  variety  of  large  and  small  designs  in  several  MOS  techno¬ 
logies.  They  are  similar  to  systems  currently  in  use  in  industry.  Although  our 
systems  have  proven  quite  useful,  we  uncovered  a  few  areas  where  they  (and 
most  other  existing  layout  systems)  are  inadequate.  The  most  severe  inade¬ 
quacy  is  in  the  area  of  routing,  where  most  systems  provide  little  support.  We 
estimate  that  between  2593  and  50%  of  all  layout  time  for  our  circuits  is  used 
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for  hand-routing  the  global  interconnections,  even  though  the  circuits  are 
highly  regular  to  begin  with.  The  task  of  routing  is  tedious  and  error-prone. 

A  more  general  problem  is  one  of  flexibility.  Once  a  design  has  been 
entered  into  the  layout  system,  it  is  hard  to  change.  This  makes  it  difficult  to 
fix  bugs  found  late  in  the  layout  process,  and  almost  impossible  to  experiment 
with  alternative  designs.  If  designers  cannot  experiment  with  and  evaluate 
alternatives,  it  b  hard  for  them  to  develop  intuition  about  what  b  good  and 
bad.  Routing  b  the  most  extreme  example  of  the  flexibility  problem.  It  takes 
so  long  to  route  a  circuit  that  it  b  out  of  the  question  to  re-route  a  chip  to  try 
a  new  floor-plan.  Even  small  celb  are  difficult  to  change:  modest  changes  to 
the  topology  of  a  cell  often  require  the  entire  cell  to  be  re-entered.  In  many 
industrial  settings,  layouts  are  so  difficult  to  enter  and  modify  that  designs  are 
completely  frozen  before  layout  begins. 

Our  overall  goal  for  Magic  b  to  increase  the  power  and  flexibility  of  the 
layout  editor  so  that  designs  can  be  entered  quickly  and  modified  easily. 
When  the  system  b  complete,  we  hope  it  will  provide  order-of-magnitude 
speedups  for  three  different  parts  of  the  design  process: 

1)  Once  a  large  circuit  has  been  routed,  it  should  be  possible  to  remove  the 
routing  and  re-route  in  a  few  hours.  Even  the  initial  routing  should  not 
require  more  than  a  few  days  for  a  large  custom  circuit.  With  our 
current  systems,  routing  requires  a  few  weeks  to  a  few  months. 
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2)  The  turnaround  time  for  small  bug  fixes  should  be  less  than  15  minutes. 
For  example,  if  a  bug  is  found  while  simulating  the  circuit  extracted  from 
a  layout,  it  should  be  possible  to  fix  the  layout,  verify  that  the  new  layout 
meets  the  design  rules,  and  re-extract  the  circuit,  all  in  15  minutes.  This 
process  currently  requires  several  hours  of  CPU  time  and  at  least  a  half¬ 
day  of  elapsed  time. 

3)  It  should  not  take  more  than  30  seconds  to  1  minute  to  re-arrange  a  cell 
to  try  out  a  different  topology.  With  our  current  systems  this  requires 
anywhere  from  tens  of  minutes  to  several  hours. 

Magic  meets  these  goals  by  combining  circuit  expertise  with  an  interactive 
editor.  It  understands  layout  rules;  it  knows  what  transistors  and  contacts  are 
(and  that  they  must  be  treated  differently  than  wires);  and  it  knows  how  to 
route  wires  efficiently.  Magic  uses  the  circuit  knowledge  to  provide  interactive 
operations  that  re-arrange  a  circuit  as  a  circuit,  rather  than  as  a  collection  of 
geometrical  objects.  It  also  performs  analysis  operations,  like  design-rule 
checking,  incrementally ,  as  the  circuit  is  created  and  modified.  This  means 
that  only  a  small  amount  of  work  must  be  done  each  time  the  circuit  is 


modified. 
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In  Magic,  as  in  most  other  layout  editors,  a  layout  consists  of  cells.  Each 
cell  contains  two  sorts  of  things:  geometrical  shapes  and  subcells.  Magic 
represents  the  contents  of  cells  using  a  technique  called  comer  stitching. 
Corner  stitching  is  a  geometrical  data  structure  for  representing  Manhattan 
shapes.  It  provides  the  underlying  mechanisms  that  make  possible  most  of 
Magic’s  advanced  features.  Corner  stitching  is  simple,  provides  a  variety  of 
efficient  searching  operations,  and  allows  the  database  to  be  modified  quickly. 
What  follows  is  a  brief  introduction  to  corner  stitching.  See  [6]  for  a  more 
complete  description. 

The  bas’c  elements  in  corner  stitching  are  planes  and  tiles.  Each  cell 
contains  a  number  of  corner-stitched  planes  to  represent  the  cell’s  geometries 


Figure  1.  Every  point  in  a  corner-stitched  plane  is  contained  in  exactly  one  tile.  In 
this  case  there  are  three  solid  tiles,  and  the  rest  of  the  plane  is  covered  by  space  tiles 
(dotted  lines).  The  space  tiles  on  the  sides  extend  to  infinity.  In  genera),  a  plane 
may  contain  many  different  types  of  tiles. 
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(b) 

Figure  2.  Areas  of  the  same  type  of  material  are  represented  with  horizontal  strips 
that  are  as  wide  as  possible,  then  as  tall  as  possible.  In  each  of  the  figures  the  tile 
structure  on  the  left  is  illegal  and  is  converted  into  the  tile  structure  on  the  right.  In 
(a)  it  is  illegal  for  two  tiles  of  the  same  type  to  share  a  vertical  edge.  In  (b)  the  two 
tiles  must  be  merged  together  since  they  have  exactly  the  same  horizontal  span. 

and  subcells;  each  plane  consists  of  a  number  of  rectangular  tiles  of  different 
types.  There  are  three  important  properties  of  a  corner-stitched  plane,  illus¬ 
trated  in  Figures  1,  2,  and  3: 

Coverage:  Each  point  in  the  x-y  plane  is  contained  in  exactly  one  tile  (Figure 
1).  Empty  space  is  represented  as  well  as  the  area  covered  with 
material. 

Strips:  Material  of  the  same  type  is  represented  with  horizontal  strips  (Fig¬ 

ure  2).  The  strip  structure  provides  a  canonical  form  for  the  data¬ 
base  and  prevents  it  from  fracturing  into  a  large  number  of  small 
tiles. 

Stitches:  Tiles  are  linked  together  at  their  corners.  Each  tile  contains  four 
of  these  links,  called  stitches  (Figure  3). 


© 
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Figure  3.  Each  tile  is  linked  to  its  neighbors  with  four  pointers,  called  comer 
atitchea.  The  corner  stitches  provide  a  form  of  two-dimensional  sorting.  They  per¬ 
mit  a  variety  of  geometrical  operations  to  be  performed  efficiently,  such  as  searching 
an  area  or  finding  all  the  neighboring  tiles  on  one  side  of  a  given  tile. 


The  stitches  permit  a  variety  of  search  operations  to  be  performed  efficiently, 
including:  finding  the  tile  containing  a  given  point;  finding  all  the  tiles  in  an 
area;  finding  all  the  tiles  that  are  neighbors  of  a  given  tile;  and  traversing  a 
connected  region  of  tiles.  The  coverage  property  makes  it  easy  to  update  the 
database  in  response  to  edits,  and  the  strip  property  keeps  the  database 
representation  small.  To  the  best  of  our  knowledge,  corner  stitching  is  unique 
in  its  ability  to  provide  these  efficient  two-dimensional  searches  and  yet  permit 
fast  updates  of  the  kind  needed  in  an  interactive  tool.  The  only  disadvantage 
of  corner  stitching  in  comparison  to  less  powerful  data  structures  is  that  it 
requires  more  storage  space  (about  three  times  as  much  space  as  structures 
based  on  linked  lists  of  rectangles).  Even  so,  the  storage  requirements  do  not 
appear  to  be  a  problem  for  chips  likely  to  be  designed  in  the  next  several 
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4.  Abstract  Layers 

There  are  several  ways  in  which  corner-stitched  planes  might  be  used  to 
represent  the  mask  geometries  in  a  cell.  One  alternative  is  to  use  a  separate 
plane  for  each  mask  layer;  each  plane  contains  space  tiles  and  tiles  of  one  par¬ 
ticular  mask  type.  The  disadvantage  of  this  approach  is  that  many  opera¬ 
tions,  such  as  design-rule  checking  and  circuit  extraction,  require  information 
about  layer  interactions  (such  as  polysilicon  crossing  diffusion  to  form  a 
transistor,  or  implants  changing  the  type  of  a  transistor).  With  a  separate 
plane  per  mask  layer,  these  operations  will  spend  a  substantial  amount  of  time 
cross-registering  the  information  on  different  planes. 

Another  alternative  is  to  place  all  the  mask  layers  into  a  single  corner- 
stitched  plane.  Since  there  can  be  only  one  tile  at  a  given  point  in  a  given 
plane,  different  tile  types  must  be  used  for  each  possible  overlap  of  mask 
layers.  This  eliminates  the  registration  problem,  but  results  in  a  large  number 
of  small  tiles  where  several  mask  layers  overlap.  Even  though  many  of  the 
layer  overlaps  are  not  significant  (such  as  metal  and  implant),  separate  tile 
types  have  to  be  used  to  represent  them.  As  a  result,  the  database  fragments 
into  a  large  number  of  tiles,  and  the  overheads  for  all  operations  increase. 

The  solution  we  chose  for  Magic  lies  between  these  two  extremes.  We 
decided  to  use  a  small  number  of  planes,  where  each  plane  contains  a  set  of 
layers  that  have  design-rule  interactions.  If  layers  do  not  have  direct  design- 
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Plane 

Tile  Types 

Poly-Diff 

Polysilicon 

Diffusion 

Enhancement  Transistor 
Depletion  Transistor 
Buried  Contact 

Poly-Metal  Contact 
Diffusion-Metal  Contact 
Space 

Metal 

Metal 

Poly-Metal  Contact 
Diffusion-Metal  Contact 
Overglass  Via  to  Metal 
Space 

Table  I.  The  corner-stitched  planes  and  tile  types  used  to  represent  the  mask  infor¬ 
mation  for  an  nMOS  process  with  buried  contacts  and  single-level  metal.  Since  po¬ 
lysilicon  and  diffusion  have  design-rule  interactions,  they  are  placed  in  the  same 
plane.  Metal  interacts  with  polysilican  and  diffusion  only  at  contacts,  so  it  is  placed 
in  a  separate  plane.  Contacts  between  metal  and  diffusion  or  polysilicon  are  dupli¬ 
cated  in  both  planes. 

rule  interactions  (such  as  poly  and  metal),  they  may  be  placed  in  different 
planes.  Some  layers,  such  as  contacts,  may  appear  in  two  or  more  planes.  In 
our  single-metal  nMOS  process  there  are  two  planes:  one  for  polysilicon, 
diffusion,  transistors,  and  buried  contacts;  and  oue  for  metal  (see  Table  I). 

We  also  decided  not  to  represent  every  mask  layer  explicitly.  Instead  of 
dealing  with  actual  mask  layers,  Magic  is  based  around  abstract  layers.  The 
abstract  layers  do  not  include  implants,  wells,  buried  contact  windows,  or  con¬ 
tact  vias.  Instead,  the  abstract  layers  include  separate  tile  types  for  each  pos¬ 
sible  kind  of  transistor  and  contact.  Magic  generates  the  missing  mask  layers 
when  it  creates  CIF  files  for  fabrication.  Table  I  gives  the  planes  and  abstract 
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Figure  4.  In  Magic,  transistors  and  contacts  are  drawn  in  an  abstract  form:  (a)  a 
three-transistor  shift-register  ceil,  showing  actual  mask  layers;  (b)  the  same  cell  as  it 
is  seen  in  Magic;  (c)  the  information  in  Magic's  poly -d iff  plane;  (d)  the  information 
in  Magic’s  metal  plane.  Contacts  are  duplicated  in  each  plane. 


layers  used  in  Magic,  and  Figure  4  illustrates  how  the  abstract  layers  are  used 


in  a  sample  cell.  Abstract  layers  change  the  way  a  circuit  looks  on  the  screen 


but  they  do  not  incur  any  density  penalty 
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The  Magic  design  style  is  similar  to  sticks  and  symbolic  systems  such  as 
Mulga  (13]  and  VIVID  [10],  except  that  the  geometries  are  fully  fleshed. 
Designers  draw  the  primary  interconnection  layers  and  simplified  forms  of  con¬ 
tacts  and  transistors.  Magic  fills  in  the  structural  details.  As  in  sticks,  there 
are  simple  operations  for  stretching  and  compacting  cells.  The  advantage  of 
Magic’s  abstract-layer  approach  is  that  designers  can  see  the  exact  size  and 
shape  of  a  cell  while  it  is  being  edited,  and  they  only  work  with  a  single 
representation  of  the  circuit.  When  using  sticks,  designers  go  back  and  forth 
between  the  sticks  and  mask  representation;  the  final  size  of  the  cell  is  hard  to 
determine  until  it  has  been  compacted  and  fleshed  out.  The  following  sections 
will  show  how  the  abstract  layers  simplify  design-rule  checking,  plowing,  and 
circuit  extraction. 

In  addition  to  the  planes  used  to  hold  mask  geometry,  each  cell  contains 
another  plane  to  hold  information  about  its  subcells.  Subcells  are  allowed  to 
overlap  in  Magic;  each  distinct  subcell  area  or  overlap  between  subcells  is 
represented  with  a  different  tile  in  the  subcell  plane.  Each  tile  contains 
pointers  to  all  of  the  subcells  that  cover  the  tile’s  area.  By  using  corner- 
stitching  in  this  way,  it  is  easy  to  find  subcell  interactions  and  to  determine 
which  (if  any)  subcells  cover  a  particular  area. 
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5.  Basic  Commands 

The  basic  set  of  commands  in  Magic  is  similar  to  the  commands  in  Caesar 
[5,7].  Mask  geometry  is  edited  in  a  style  like  painting:  a  rectangle  is  placed 
over  an  area  of  the  layout,  and  mask  layers  may  be  painted  or  erased  over  the 
area  of  the  rectangle.  Additional  operations  are  provided  to  make  a  copy  of 
all  the  “paint”  in  a  rectangular  area  and  copy  it  back  at  a  different  place  in 
the  layout.  The  corner-stitched  representation  is  invisible  to  users. 

Magic  also  provides  commands  for  manipulating  subcells.  Subcells  may 
be  placed  in  a  parent,  mo  zed,  mirrored  in  x  or  y,  rotated  (by  multiples  of  90 
degrees  only),  arrayed,  and  deleted.  Subcells  are  handled  by  reference,  not  by 
copying:  if  a  subcell  is  modified,  the  modifications  will  be  reflected  everywhere 
that  the  subcell  is  used. 

0.  Incremental  Design-Rule  Checking 

Design-rule  checking  is  an  integral  part  of  the  Magic  system.  Our  main 
goal  was  to  make  the  checker  very  fast,  particularly  for  small  changes:  the 
cost  of  reverifying  a  layout  should  be  proportional  to  the  amount  of  the  layout 
that  has  been  changed,  not  to  the  total  size  of  the  layout.  To  achieve  this, 
Magic’s  design-rule  checker  runs  continuously  in  the  background  during  edit¬ 
ing  sessions.  When  the  layout  is  changed,  Magic  records  the  areas  that  must 
be  reverified.  The  design-rule  checker  then  rechecks  these  areas  during  the 
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time  when  the  user  is  thinking.  For  small  changes,  error  information  appears 
on  the  screen  instantly  (and  also  disappears  instantly  when  the  problem  has 
been  fixed).  For  large  changes  (such  as  moving  one  large  subcell  on  top  of 
another),  it  may  take  seconds  or  minutes  for  the  design-rule  checker  to  com¬ 
plete  its  job.  In  the  meantime,  the  designer  can  continue  editing.  If 
reverification  hasn’t  been  completed  when  an  editing  session  ends,  the  areas 
still  to  be  reverified  are  stored  with  the  cell  so  that  reverification  can  be  com¬ 
pleted  the  next  time  the  cell  is  edited.  Error  information  is  also  stored  with 
cells  until  the  errors  are  fixed.  With  this  mechanism,  there  is  never  a  need  to 
check  a  layout  “from  scratch.” 

Magic’s  basic  rule-checker  works  from  the  edges  in  a  design.  Based  on 
the  type  of  material  on  either  side  of  an  edge,  it  verifies  constraints  that 
require  certain  layers  to  be  present  or  absent  in  areas  around  the  edge.  There 
are  several  reasons  why  corner  stitching  and  the  abstract  layers  allow  edge 
rules  to  be  checked  quickly.  Each  corner-stitched  plane  can  be  checked 
independently.  All  the  “interesting”  edges  are  already  present  in  the  tile 
structure,  so  there  is  no  need  to  register  different  mask  layers.  The  abstract 
layers  make  it  unnecessary  to  check  formation  rules  associated  with  implants 
and  vias.  Lastly,  corner  stitching  provides  efficient  algorithms  for  locating  all 
the  edges  in  an  area  and  for  searching  the  constraint  areas. 
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In  addition  to  a  fast  basic  checker,  the  incremental  rule  checker  contains 
algorithms  for  handling  hierarchy.  When  a  cell  in  the  middle  of  a  hierarchical 
layout  is  changed,  Magic  checks  interactions  between  this  cell  and  its  subcells, 
and  also  interactions  between  this  cell  and  other  cells  in  its  parents  and 
grandparents.  More  details  on  the  basic  DRC  mechanism  and  on  Magic’s 
hierarchical  approach  can  be  found  in  [12). 

7.  Plowing 

Plowing  is  a  simple  operation  that  can  be  used  to  rearrange  a  layout 
without  changing  the  electrical  circuit  that  it  represents.  To  invoke  the  plow 
operation,  the  user  specifies  a  vertical  or  horizontal  line  segment  (the  plow) 
and  a  distance  perpendicular  to  it  (the  plow  distance).  See  Figure  5.  Magic 


diffusion 


Figure  5.  In  plowing,  a  horizontal  or  vertical  line  is  moved  across  the  circuit,  push¬ 
ing  material  out  of  its  way.  Design  rules  and  connectivity  are  maintained. 


Magic:  A  VLSI  Layout  System 


December  2,  1983 


sweeps  the  plow  for  the  specified  distance,  and  moves  and  moves  all  material 
out  of  the  area  swept  out  by  the  plow.  The  edges  of  this  material  are  likewise 
treated  as  plows,  pushing  other  material  in  front  of  them.  Mask  geometry  in 
front  of  the  plow  is  compacted  as  it  is  moved,  and  mask  geometry  that  crosses 
the  initial  position  of  the  plow  is  stretched  behind  the  plow.  Jogs  are  inserted 
at  the  ends  of  the  plow.  The  plow  operation  maintains  design  rules  and  con¬ 
nectivity  so  that  it  doesn’t  change  the  electrical  structure  of  the  circuit.  Most 
material,  such  as  polysilicon,  diffusion,  and  metal,  may  be  stretched  or  com¬ 
pacted  by  plowing;  transistors  and  contacts  may  be  moved,  but  their  shape 
will  not  change. 

Plowing  provides  all  the  operations  of  a  sticks-based  system,  while  still 
working  with  fully-fleshed  geometry.  If  a  large  plow  is  placed  to  one  side  of  a 
cell  and  then  moved  across  the  cell,  the  cell  will  be  compacted.  If  a  large  plow 
is  placed  across  the  middle  of  the  cell  and  moved,  the  cell  will  be  stretched  at 
that  point.  A  small  plow  placed  in  the  middle  of  a  cell  can  be  used  to  open  up 
empty  space  for  new  transistors  or  wiring.  Plowing  may  be  used  both  on  low- 
level  cells  containing  only  geometry,  and  on  high-level  cells  containing  subcells 
and  routing.  Plowing  moves  each  subcell  as  a  unit,  without  affecting  the  con¬ 
tents  of  the  subcell. 

The  implementation  of  plowing  is  dependent  on  corner  stitching,  abstract 
layers,  and  the  edge-based  design  rules.  Corner  stitching  provides  the  fast 
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geometric  operations  used  to  search  out  plow  areas.  The  abstract  layers  tell 
Magic  about  materials  that  cannot  be  stretched  or  compacted  (such  as  transis¬ 
tors).  The  edge-based  design  rules  indicate  what  must  be  moved  out  of  the 
way  when  a  particular  edge  of  material  is  moved.  By  working  from  the  same 
data  structure  used  for  editing  and  design-rule  checking,  the  plowing  operation 
avoids  the  overhead  of  converting  between  representations.  See  [11]  for  a 
detailed  presentation  of  the  plowing  operation  and  its  implementation. 

8.  Circuit  Extraction  and  Cell  Overlaps 

The  Magic  database  makes  circuit  extraction  almost  trivial  for  individual 
cells.  Because  of  the  abstract  layers  and  corner  stitching,  the  circuit  is  almost 
completely  extracted  to  begin  with.  All  that  is  needed  is  to  traverse  the  tile 
structure  and  record  information  about  what  connects  to  what.  There  is  no 
need  to  register  layers  or  infer  the  structure  and  type  of  transistors  and  con¬ 
tacts:  all  this  information  is  represented  explicitly. 

For  hierarchical  designs,  the  situation  is  complicated  when  cells  overlap. 
Each  cell  uses  a  separate  set  of  corner-stitched  planes,  so  information  from  the 
separate  planes  must  be  combined  in  order  to  find  out  what  connects  to  what. 
If  arbitrary  overlaps  are  allowed,  then  transistors  may  be  split  between  cells, 
or  may  be  formed  or  broken  by  cell  overlaps.  In  this  case,  circuits  cannot  be 
extracted  hierarchically,  since  the  structure  of  a  cell  may  be  changed  by  the 
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way  it  is  used  in  its  parents. 

One  approach  to  the  overlap  problem  is  to  prohibit  cell  overlaps.  This 
has  two  drawbacks,  however.  First,  it  makes  for  clumsy  designs,  since  overlap 
areas  must  be  placed  in  separate  cells.  This  makes  it  harder  to  understand 
designs  and  harder  to  re-use  cells.  Second,  it  doesn’t  eliminate  the  problems  in 
circuit  extraction,  since  information  will  still  have  to  be  registered  along  the 
boundaries  of  abutting  cells.  For  example,  a  cell  abutment  can  cause  two 
separate  transistors  to  join  together. 

Instead  of  prohibiting  overlaps,  we  decided  to  restrict  them.  In  Magic, 
cells  may  abut  or  overlap  as  long  as  this  only  connects  portions  of  the  cells 
without  changing  their  transistor  structure.  Overlaps  and  abuttments  may  not 
change  the  type  or  number  of  transistors  from  what  it  would  be  without  the 
overlap  (e.g.  polysilicon  from  one  cell  may  not  overlap  diffusion  from  another 
cell,  since  this  would  create  a  new  transistor).  These  restrictions  can  be 
verified  by  using  a  special  set  of  design  rules  in  the  part  of  the  design-rule 
checker  that  deals  with  cell  overlaps. 

Our  solution  still  requires  information  to  be  registered  between  subcells, 
but  it  allows  the  extracted  circuit  to  be  represented  (and  extracted)  hierarchi¬ 
cally.  The  extracted  circuit  for  any  cell  consists  of  the  circuits  of  its  subcells, 
plus  the  circuit  of  the  cell  itself,  plus  a  few  connections  between  the  subcells. 
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9.  Routing 

Routing  is  the  single  most  important  area  where  we  hope  Magic  will  speed 
up  the  design  process.  Most  of  the  Magic  routing  effort  has  been  spent  in  two 
areas:  a)  creating  a  channel  router  that  can  work  around  obstacles  in  the 
channel  (such  as  previously-placed  interconnections  and  power  and  ground 
routing);  and  b)  developing  an  interface  between  grid-based  routers  and  non- 
gridded  custom  designs. 

Magic  uses  a  standard  three-phase  approach  to  routing.  In  the  first 
phase,  called  channel  decomposition,  the  empty  space  of  the  layout  is  divided 
up  into  rectangular  channels.  In  the  second  phase,  called  global  routing,  nets 
are  processed  sequentially  to  decide  which  channels  will  be  crossed  by  each.  In 
the  third  phase,  called  channel  routing,  each  channel  is  considered  separately 
and  wires  are  placed  to  achieve  the  necessary  connections  within  the  channel. 
Magic’s  channel  decomposer  (which  is  not  yet  implemented)  will  be  based  on 
the  bottleneck  approach  of  the  BBL  system  [1],  Global  routing  (also  not  yet 
implemented)  will  use  a  standard  wavefront  approach  (4).  Both  of  these  will 
use  corner-stitching  to  keep  track  of  the  channel  space.  The  channel  router 
has  been  implemented,  and  is  an  extended  version  of  Rivest’s  greedy  router  (9). 
Magic  does  not  provide  placement  tools:  in  our  design  style,  placement  is  an 
important  architectural  decision  and  must  be  handled  by  designers. 
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Figure  0.  An  example  of  routing  with  a  single-layer  obstacle  in  the  channel.  The 
router  tries  to  avoid  the  thickest  part  of  the  obstacle  if  possible. 


In  order  to  make  the  routing  tools  useable  in  a  custom  design  environ¬ 
ment,  we  have  developed  a  channel  router  that  can  work  around  obstacles  in 
the  channels.  It  is  important  for  designers  to  be  able  to  wire  critical  nets  by 
hand,  and  to  have  the  automatic  routing  tools  route  the  less  critical  nets 
without  affecting  the  hand-routed  ones.  It  is  also  convenient  to  run  power  and 
ground  routing  tools  as  a  separate  step  before  signal  routing,  and  have  the  sig¬ 
nal  router  work  around  the  power  and  ground  wires.  Where  there  are  obsta¬ 
cles  in  the  channels,  Magic  will  route  under  them  if  possible,  and  will  route 
around  those  that  block  both  routing  layers.  For  very  large  obstacles  in  one 
layer,  such  as  a  wide  metal  ground  bus,  Magic  can  make  interconnections 
under  the  obstacles  using  river- routing.  See  [2j  for  details  on  how  Rivest’s 
greedy  router  has  been  extended  to  handle  obstacles.  Figure  6  shows  an  exam¬ 
ple  of  results  produced  by  the  Magic  router. 
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The  evasive  router,  combined  with  plowing  and  the  other  editing  features, 
provides  designers  with  considerable  flexibility.  Critical  signals  and  power  and 
ground  can  be  routed  by  hand.  Then  the  router  can  be  invoked  to  complete 
the  rest  of  the  interconnections.  If  the  router  is  unable  to  make  all  connec¬ 
tions,  the  final  ones  can  be  placed  by  hand.  Or,  plowing  can  be  used  to  re¬ 
arrange  the  placement  and  the  router  can  be  re-run.  The  plowing  operation 
will  maintain  the  existing  connections. 

We  have  also  extended  the  standard  routing  approach  to  handle  designs 
that  are  not  based  on  a  uniform  routing  grid.  Most  channel  routers  assume  a 
uniform  grid  based  on  the  minimum  wire  spacing:  channel  dimensions  must  be 
an  integral  number  of  grid  units,  and  all  wires  must  enter  and  leave  channels 
on  grid  points.  Unfortunately,  custom  cells  are  not  usually  designed  with  the 
router’s  grid  in  mind,  so  the  cell  boundaries  and  terminals  do  not  line  up  on  a 


Expanded  Cell  Boundary 


Figure  7.  In  the  sidewalk  approach,  each  cell  is  enlarged  so  that  its  boundary  is 
grid-aligned.  Then  connections  on  the  edge  of  the  original  cell  are  routed  to  grid 
points  on  the  outside  of  the  sidewalk. 
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master  grid.  We  are  experimenting  with  two  approaches  to  this  problem, 
called  sidewalks  and  flexible  grid. 

The  sidewalk  approach  is  illustrated  in  Figure  7,  and  involves  a  pre¬ 
routing  step  where  all  cells  are  expanded  so  that  their  dimensions  are  integral 
grid  units.  This  additional  cell  area  is  called  its  sidewalk.  In  addition,  wires 
are  added  to  connect  the  terminals  of  the  cell  to  grid  points  on  the  outer  edge 
of  the  sidewalk.  After  the  sidewalk  generation  stage,  everything  is  grid 
aligned  so  standard  routing  tools  can  be  used.  Magic  currently  implements  the 
sidewalk  approach.  Sidewalks  are  inefficient  because  the  sidewalk  areas  can¬ 
not  be  used  for  channel  routing,  even  though  they  usually  contain  little 
material.  Sidewalks  typically  cause  the  channels  to  be  reduced  in  size  by  2-3 
tracks  and  2-3  columns. 
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Figure  8.  Rather  than  expand  cells  to  grid  points  as  in  the  sidewalk  approach,  the 
flexible  grid  approach  modifies  the  track  and  column  structure  of  the  channel.  The 
channel  is  grid-based  in  the  center,  but  the  grid  lines  jog  at  the  edges  to  meet  up 
with  non-gridded  connections,  (a)  shows  the  standard  orthogonal  channel  structure, 
and  (b)  shows  a  channel  whose  grid  structure  has  been  flexed.  The  flexible  grid  ap¬ 
proach  can  result  in  tracks  or  columns  that  don’t  extend  all  the  way  across  the  chan¬ 
nel. 
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The  flexible  grid  approach  distributes  the  sidewalks  among  the  channels 
by  jogging  the  track  and  column  structure  at  the  ends  to  match  up  with  con¬ 
nection  points  that  don’t  fall  on  grid  lines.  This  is  illustrated  in  Figure  8.  In 
the  flexible  grid  approach,  wasted  space  occurs  within  the  channel  because 
some  columns  and  channels  cannot  extend  all  the  way  across  the  channel. 
However,  there  appears  to  be  less  wasted  space  in  this  approach  than  in  the 
sidewalk  approach.  In  the  worst  case,  the  wasted  space  is  equivalent  to  two 
tracks  and  two  columns  per  channel.  If  connection  points  are  sparse,  however 
(and  this  is  usually  the  case),  the  flexible  grid  approach  has  almost  zero  wasted 
space.  We  are  still  in  the  early  stages  of  exploring  this  alternative. 

10.  User  Interface 

Magic  displays  the  layout  on  a  color  display,  and  users  invoke  commands 
by  pointing  on  the  display  with  a  mouse  and  then  pushing  mouse  buttons  or 
typing  keyboard  commands.  Magic  provides  multiple  overlapping  windows  on 
the  color  display.  Each  window  is  a  separate  rectangular  view  on  a  layout. 
Different  windows  may  refer  to  different  portions  of  a  single  cell,  or  to  totally 
different  cells.  Windows  allow  designers  to  see  an  overall  view  of  the  chip 
while  zooming  in  on  one  or  more  pieces  of  the  chip;  this  permits  precise  align¬ 
ments  of  large  objects.  Information  can  be  copied  from  one  window  to 


another. 
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11.  Technology  Independence 

Although  Magic  contains  a  considerable  amount  of  knowledge  about 
integrated  circuits,  the  information  is  not  embedded  directly  in  code.  All  the 
circuit  information  is  contained  in  a  technology  file  that  Magic  reads.  This  file 
defines  the  abstract  layers  for  a  particular  technology,  the  corner-stitched 
planes  used  to  represent  them,  and  the  assignment  of  abstract  layers  to  planes. 
It  tells  how  to  display  the  various  layers  and  defines  the  semantics  of  the  paint 
and  erase  operations  from  Section  5  (for  example  “if  poly-metal-contact  is 
painted  over  diffusion,  erase  the  diffusion  and  place  poly-metal-contact  tiles  on 
both  the  poly-diff  and  metal  planes”).  The  technology  file  contains  the  design 
rules  used  in  design-rule  checking  and  in  plowing.  Lastly,  it  tells  how  to  fill  in 
the  structural  details  of  transistors  and  contacts  when  generating  CIF  for  cir¬ 
cuit  fabrication.  The  technology  file  format  is  general  enough  to  handle  a 
variety  of  nMOS  and  CMOS  processes.  Our  technology  file  for  an  nMOS  pro¬ 
cess  with  buried  contacts  and  single-level  metal  contains  about  130  lines. 

12.  Implementation 

The  implementation  of  Magic  was  begun  in  February  of  1983.  By  early 
April  1983,  a  primitive  version  of  the  system  was  operational.  Although  the 
first  system  was  based  on  corner  stitching  and  abstract  layers,  it  provided  user 
features  only  equivalent  to  Caesar.  During  the  summer  of  1983  implementa- 
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Subsystem 

Implementation  Status 

Edge-based  DRC 

Operational  9/1/83 

Hierarchical  and  Continuous  DRC 

Operational  11/1/83 

Circuit  Extraction 

Not  begun 

Plowing 

Simplified  version  operational  10/1/83 

Full  version  expected  1/1/84 

Net  List  Editing 

Operational  5/1/83 

Channel  Decomposition 

Expected  1/1/84 

Global  Router 

Expected  2/1/84 

Channel  Router  with  Obstacle  Avoidance 

Operational  10/1/83 

Multiple  Windows 

Operational  11/1/83 

Table  H  The  implementation  status  of  Magic. 


tion  was  begun  on  the  subsystems  for  routing,  multiple  windows,  plowing,  and 
design-rule  checking.  As  of  this  writing,  most  of  the  advanced  features  are 
either  operational  or  expected  to  be  operational  in  the  near  future.  See  Table 
II.  The  system  has  been  in  use  since  April  1983  by  the  designers  of  a  32-bit 
microprocessor  [8],  and  since  September  1983  by  several  dozen  students  in  an 
introductory  VLSI  design  class. 


Operation 

Speed 

Painting  tiles  into 
corner-stitched  database 

200  tiles/sec. 

Design-rule  checking 

200  tiles/sec. 

Simplified  Plowing 

100  tiles/sec. 

Channel  routing  (“Deutsch’s 
difficult  example,”  60  nets) 

3  sec. 

Table  m.  Some  sample  measurements  of  the  speed  of  the  Magic  system.  All  meas¬ 
urements  were  made  on  a  VAX-11/780. 
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Magic  is  written  in  C  under  the  Berkeley  4.2  Unix  operating  system  for 


VAX  processors.  The  current  implementation  works  only  with  AED  color 


displays  with  special  Berkeley  microcode  extensions.  Altogether,  Magic  con¬ 


tains  approximately  45000  lines  of  code.  Table  HI  gives  a  few  sample  perfor¬ 


mance  measurements  of  pieces  of  the  system. 


13.  Conclusions 


We  have  not  yet  had  enough  designer  experience  with  Magic  to  evaluate 


the  system  thoroughly,  but  the  initial  response  has  been  favorable.  The  only 


major  problem  encountered  so  far  has  been  one  of  education:  if  designers  are 


accustomed  to  working  with  actual  mask  layers,  then  the  abstract  layers  in 


Magic  are  confusing  at  first.  This  problem  was  exacerbated  in  the  early  ver¬ 


sions  of  the  system  because  the  design-rule  checker  wasn’t  implemented.  With 


continuous  feedback  from  the  checker,  we  hope  that  it  will  be  much  easier  for 


designers  to  learn  the  abstract  layers.  We  expect  that  the  abstract  layers  will 


be  easier  for  designers  to  work  with  than  the  actual  mask  layers,  since  they 


hide  many  irrelevant  details. 


The  pieces  of  the  Magic  system  work  well  together.  Corner  stitching 


appears  to  be  a  complete  success:  it  provides  all  the  operations  needed  to 


implement  Magic’s  advanced  features,  and  results  in  simple  and  fast  algo¬ 


rithms.  The  design-rule  checker’s  edge-based  rule  set  meshes  well  with  the 


i 


os  . 
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corner-stitched  data,  and  is  used  also  for  plowing,  fhe  abstract  layers  simplify 
the  design  rules,  provide  information  needed  for  plowing  and  circuit  extrac¬ 
tion,  and  simplify  the  designer’s  view  of  the  layout. 

We  hope  that  Magic’s  flexibility  will  change  the  VLSI  layout  process  in 
two  ways.  First,  we  hope  that  it  will  enable  designers,  to  experiment  much 
more  than  previously.  At  the  cell  level,  they  can  use  plowing  to  rearrange 
cells  quickly  and  easily.  Cells  can  be  designed  loosely,  then  compacted.  At 
the  chip  level,  plowing  and  the  routing  tools  can  be  used  together  to  re¬ 
arrange  the  floorplan,  route  the  connections,  compact  or  stretch,  and  try 
again.  The  ability  to  experiment  means  that  students  will  be  able  to  develop 
better  intuitions  about  how  to  design  chips;  it  also  means  that  designers  will 
be  able  to  fix  bugs  and  enhance  performance  more  easily. 

Second,  we  hope  that  Magic  will  make  it  easier  to  reuse  pieces  of  designs. 
To  design  a  new  chip,  a  designer  will  select  cells  from  a  large  library,  use 
plowing  and  painting  to  make  slight  modifications  in  their  shape  or  function  to 
suit  the  new  application,  and  perhaps  design  a  few  new  cells.  Then  the  rout¬ 
ing  tools  will  be  used  to  interconnect  the  cells.  We  hope  that  this  approach 
will  result  in  a  substantial  reduction  in  design  time  for  large  circuits. 
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ABSTRACT 

The  Magic  VLSI  layout  editor  contains  an  incremental  design-rule  checker. 
When  the  circuit  is  changed,  only  the  modified  areas  are  rechecked.  The 
checker  runs  continuously  in  background  to  keep  information  about  design- 
rule  violations  up-to-date.  This  paper  describes  the  basic  rule  checker,  which 
operates  on  edges  in  the  layout,  and  the  techniques  used  to  perform  incremen¬ 
tal  checking  on  hierarchical  designs. 

Keywords  and  Phrases:  design-rule  checking,  interactive  layout  editor 
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1.  Introduction 

Almost  all  existing  design-rule  checking  (DRC)  programs  are  batch 
oriented  [lj  [2j.  They  read  in  a  complete  circuit  layout  and  check  the  entire 
design.  If  the  circuit  is  changed,  the  only  way  to  find  out  whether  design  rules 
have  been  violated  is  to  recheck  the  entire  design,  no  matter  how  small  the 
change  or  how  large  the  design.  For  chips  with  tens  of  thousands  of  transis¬ 
tors,  batch  DRC  run  may  require  hours  of  computer  time. 

This  paper  describes  a  different  approach  to  design-rule  checking.  As 
part  of  the  Magic  VLSI  layout  editor  [3],  we  have  built  a  checker  that  operates 
incrementally.  When  the  layout  is  modified,  Magic  records  which  areas  have 
changed  and  rechecks  only  those  areas.  While  the  user  continues  editing,  the 
checker  runs  in  background  and  highlights  errors  as  it  finds  them.  There  is  no 
set-up  time  because  it  works  from  the  same  data  structure  used  to  represent 
the  layout.  Since  most  changes  made  with  the  interactive  editor  are  small  and 
the  checker  is  fast,  it  can  usually  display  errors  instantly. 

The  user’s  view  of  design-rule  checking  is  a  simple  one.  As  he  edits  the 
circuit,  small  white  dots  appear  over  areas  that  contain  layout  errors.  As  soon 
as  the  errors  are  fixed,  the  white  dots  go  away.  Error  informaton  is  stored 
with  the  design  and  it  will  reappear  during  the  next  editing  session  if  the  viola¬ 
tion  has  not  been  fixed.  This  information  is  always  kept  up-to-date,  so  there  is 
never  any  need  to  run  a  batch  checker. 
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In  the  next  section,  we  describe  Magic’s  internal  representation  for  a  lay¬ 
out  and  explain  how  particular  features  contribute  to  fast  incremental  check¬ 
ing.  Section  3  describes  how  the  basic  checker  works  from  edges  in  the  layout 
and  how  design  rules  are  specified.  Section  4  shows  how  we  use  the  basic 
checker  for  incremental  checking  of  individual  cells,  and  section  5  describes 
how  hierarchical  designs  are  handled.  Section  6  gives  measurements  of  the 
checker’s  speed. 

2.  Representation  of  a  Layout 

In  Magic,  a  layout  is  represented  as  a  hierarchical  collection  of  cells. 
Each  cell  contains  mask  information  plus  pointers  to  subcells.  For  now,  we 
will  consider  only  a  single  cell  at  a  time  (Section  5  generalizes  the  solution  to 
handle  hierarchical  designs). 

Magic  represents  the  mask  layers  of  a  cell  with  rectangular  tiles,  which 
means  that  it  handles  only  Manhattan  geometries.  Each  tile  indicates  the  type 
of  mask  layer  it  represents.  Tiles  are  connected  to  form  planes  by  a  technique 
called  comer-stitching  [2j  illustrated  in  Figure  1.  The  tiles  in  a  plane  are 
non-overlapping  and  cover  it  completely.  Empty  areas  are  covered  with  tiles 
of  type  “space.” 

Each  cell  contains  several  planes  of  mask  information.  Mask  types  that 
interact  (such  as  polysilicon  and  diffusion)  are  stored  together  in  the  same 
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Figure  1.  An  example  of  a  corner-stitched  plane.  Each  plane  contains  tiles  of 
different  types  that  cover  the  entire  area  of  the  plane  (space  tiles  are  used  where 
there  is  no  mask  material).  Each  tile  contains  four  pointers  that  link  it  to  neighbor¬ 
ing  tiles  at  its  corners.  The  pointers  make  it  easy  to  find  all  the  material  in  a  given 
area. 

plane,  while  those  that  do  not  interact  (such  as  polysilicon  and  metal)  are 
stored  in  different  planes.  Contacts  between  mask  types  on  different  planes  are 
represented  in  both  of  them.  Our  nMOS  process  has  two  planes:  one  for 
metal  and  one  for  polysilicon,  diffusion,  and  transistors. 

Instead  of  working  directly  with  physical  mask  layers,  Magic  uses  abstract 
layers  to  represent  structures  such  as  transistors  and  contacts.  The  abstract 
layers  appear  in  the  database  as  tiles  with  special  types.  For  example,  instead 
of  representing  an  enhancement  transistor  as  a  polysilicon  tile  over  a  diffusion 
tile,  it  is  represented  with  a  tile  of  type  “enhancement  transistor.”  A  more 
complete  explanation  of  the  abstract  layers  is  given  in  [3].  What  matters  here 
is  that  all  the  interesting  features  are  represented  explicitly:  there  is  no  need 
to  cross-register  diffusion  and  polysilicon  to  discover  the  transistors. 
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The  design-rule  checker  takes  advantage  of  Magic’s  database  in  three 
ways.  First,  the  corner-stitched  tiles  allow  DRC  to  find  material  in  a  given 
area  very  quickly.  Second,  division  of  mask  information  into  planes  allows  the 
checker  to  work  with  one  plane  at  a  time,  ignoring  irrelevant  geometry  on 
other  planes.  Third,  there  is  no  need  to  extract  features  by  registering  layers: 
the  abstract  layers  represent  the  important  features  explicitly.  Because  of 
these  features,  there  is  no  need  for  the  checker  to  manage  a  separate  structure 
of  its  own:  it  works  directly  from  the  layout  database. 

3.  The  Basie  Checker 

This  section  describes  the  basic  design-rule  checking  paradigm  used  to 
validate  an  area  of  a  single  corner-stitched  plane.  Later  sections  show  how 
this  basic  checker  is  used  to  perform  incremental  checks  on  a  single  cell,  and 
then  on  a  hierarchy  of  cells. 

3.1.  Edge-based  Rales 

Magic’s  design  rules  are  based  on  edges  between  tiles.  Each  rule  can  be 
applied  in  any  of  four  directions,  two  for  horizontal  edges  and  two  for  vertical 
edges.  The  rule  database  contains  a  separate  list  of  rules  for  each  possible 
combination  of  materials  on  the  first  and  second  sides  of  an  edge.  In  its  sim¬ 
plest  form,  a  rule  specifies  a  distance  and  a  set  of  mask  types:  only  the  given 
types  are  permitted  within  that  distance  on  the  second  side  of  the  edge.  This 


t 
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Only  certain  tile  types  are  allowed 
in  the  dashed  constraint  regions. 


type  1 1  type  2 


type  2  I  type  1 


type  2 


type  1 


type  1 


type  2 


Figure  2.  Design  rules  are  applied  at  the  edges  between  tiles  in  the  same  plane.  A 
rule  is  specified  in  terms  of  type  1  and  type  2,  the  materials  on  either  side  of  the  edge. 
Each  rule  may  be  applied  in  any  of  four  directions,  as  shown  by  the  arrows.  The 
simplest  rules  require  that  only  certain  mask  types  can  appear  within  distance  d  on 
type  £*s  side  of  the  edge. 
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Unfortunately,  this  simple  scheme  will  miss  errors  in  corner  regions,  as 
shown  in  Figure  3.  To  eliminate  these  problems,  the  full  rule  format  allows 
the  constraint  region  to  be  extended  past  the  ends  of  the  edge  under  some  cir¬ 
cumstances.  See  Figure  4  for  an  illustration  of  the  corner  rules  and  how  they 
work.  Table  1  gives  a  complete  summary  of  the  information  in  each  design 
rule. 


tile  types  allowed: 


constraint 

regions 


(a)  (b) 

Figure  3.  If  only  the  simple  rules  from  Figure  2  are  used,  errors  may  go  unnoticed 
in  corner  regions.  For  example,  the  polysilicon  spacing  rule  in  (a)  will  fail  to  detect 
the  error  in  (b). 
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Figure  4.  The  complete  design  rule  format  is  illustrated  in  (a).  Whenever  an  edge 
has  type  1  on  its  left  side  and  type  S  on  its  right  side,  the  area  A  is  checked  to  be 
sure  that  only  types  allowed  are  present.  If  the  material  just  above  and  to  the  left  of 
the  edge  is  one  of  comer  types,  then  area  B  is  also  checked  to  be  sure  that  it  con¬ 
tains  only  types  allowed.  A  similar  corner  check  is  made  at  the  bottom  of  the  edge. 
Figure  (b)  shows  one  of  the  polysilicon  spacing  rules,  (c)  shows  a  situation  where 
corner  extension  is  performed  on  both  ends  of  the  edge,  and  (d)  shows  a  situation 
where  corner  extension  is  made  only  at  the  bottom  of  the  edge. 


Parameter 

Meaning 

type  1 

Material  on  first  side  of  edge. 

type  2 

Material  on  second  side  of  edge. 

d 

Distance  to  check  on  second  side  of  edge. 

layers 

allowed 

List  of  layers  that  are  permitted 
within  d  units  on  second  side  of  edge. 

comer 

types 

List  of  layers  that  cause  corner  extension. 

ce 

Amount  to  extend  constraint  area 
when  corner  types  match. 

Table  1.  The  parts  of  an  edge-based  rule. 
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3.2.  Applying  the  Rules 

To  check  an  area  of  a  single  plane,  Magic  must  first  find  all  the  edges  in 
that  area.  This  is  accomplished  by  searching  for  all  the  tiles  in  the  area.  The 
corner-stitched  data  structure  is  well  suited  to  searches  of  this  sort:  see  [4]. 
For  each  tile,  the  checker  examines  its  left  and  bottom  sides  (the  top  and  right 
sides  of  the  tile  will  be  checked  by  the  neighbors  on  those  sides).  Since  the  tile 
may  have  neighbors  of  different  types  on  the  same  side,  the  checker  searches 
through  all  the  neighbors  to  divide  the  side  of  the  tile  into  edges  with  a  single 
material  on  each  side. 

To  process  an  edge,  the  mask  types  on  each  side  of  it  are  used  to  index 
into  the  rule  table  to  find  the  list  of  rules  for  that  kind  of  edge.  Each  rule  in 
the  list  is  checked,  and  white  dots  are  displayed  for  any  areas  where  the  con¬ 
straints  are  not  satisfied.  For  each  edge  there  are  two  rule  applications:  left- 
to-right  and  right-to-left  (for  vertical  edges)  or  bottom-to-top  and  top-to- 
bottom  (for  horizontal  edges).  A  different  list  of  rules  is  applied  in  each  direc¬ 
tion,  since  the  layers  are  reversed. 

3.3.  Specifying  Design  Rules 

Design  rules  are  specified  in  a  technology  file  that  contains  the  rules  and 
other  technology-specific  information.  When  Magic  starts  executing,  it  reads 
this  file  and  builds  the  rule  table.  Initially  we  specified  rules  in  the  detailed 
form  of  Table  I,  with  one  line  for  each  edge  rule.  This  scheme  proved  to  be 
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unworkable,  because  there  were  many  rules  and  it  became  difficult  to  convince 
ourselves  that  the  rule  set  was  complete  and  correct. 

In  order  to  simplify  the  process  of  creating  rule  sets,  Magic  now  permits 
rules  to  be  specified  with  high  level  macros  for  width  and  spacing.  For  exam¬ 


ple,  the  macro 


spacing  ef  DP  1 


is  expanded  into  several  rules  to  verify  that  types  e  and  f  (enhancement  and 
depletion  transistors)  are  always  separated  from  types  D  and  P  (diffusion-metal 
contacts  and  poly-metal  contacts)  by  at  least  one  unit.  The  macro 


width  pPBef  2 


is  expanded  into  the  set  of  edge  rules  needed  to  verify  that  the  entire  region 
containing  any  of  the  five  types  P,  B,  e,  f  or  p  (polysilicon)  is  always  at  least 
two  units  wide. 

Most  of  the  rules  for  our  processes  are  simple  width  and  spacing  checks, 
so  these  two  macros  considerably  simplify  the  writing  of  rule  sets.  Our  nMOS 
rule  set  contains  8  width  rules,  6  spacing  rules,  and  9  of  the  detailed  edge  rules 
for  situations  that  cannot  be  handled  by  the  width  and  spacing  rules  (e.g. 
transistor  overhangs).  Magic  expands  these  23  high-level  rules  into  126 
detailed  edge  rules.  The  complete  high-level  rule  set  for  nMOS  is  given  in  the 
Appendix. 
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The  width  and  spacing  macros  make  Magic’s  checker  more  efficient 
because  the  width  and  spacing  rules  are  symmetric.  If  layers  x  and  y  are  too 
close  together,  the  violation  can  be  detected  from  either  an  edge  of  x  or  an 
edge  of  y.  This  means  that  it  is  unnecessary  to  check  the  rules  from  both 
edges.  Magic  takes  advantage  of  this  symmetry  by  checking  width  and  spac¬ 
ing  rules  in  only  two  directions  (left-to-right  and  bottom-to-top).  In  addition, 
symmetric  rules  mean  that  corner  extension  is  only  necessary  on  one  end  of 
each  edge.  Since  most  of  the  detailed  edge  rules  come  from  the  width  and 
spacing  macros,  this  speeds  up  the  checking  process  by  almost  a  factor  of  two. 

4.  Continuous  Design-Rule  Checking 

This  section  shows  how  the  basic  checker  is  used  to  provide  continuous 
incremental  rule  validation.  As  in  the  previous  section,  we  consider  only 
single-cell  designs  here. 

In  order  to  perform  DRC  incrementally,  Magic  maintains  two  extra  kinds 
of  information  with  each  cell,  stored  in  the  same  form  as  mask  layers.  First, 
Magic  keeps  information  about  rule  violations  that  have  been  detected  but 
haven’t  been  corrected.  The  violations  are  represented  by  error  tiles  that 
cover  the  areas  where  rule  constraints  are  not  satisfied.  The  second  kind  of 
information  consists  of  tiles  describing  the  areas  of  the  circuit  that  need  to  be 
reverified.  The  error  tiles  and  the  reverify  tiles  are  stored  in  separate  corner- 
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stitched  planes.  Each  cell  contains  its  own  error  and  reverify  planes. 

When  a  designer  changes  a  cell,  Magic  creates  reverify  tiles  that  cover  the 
area  modified.  The  design-rule  checker  runs  in  background  while  Magic  is 
waiting  for  the  designer  to  enter  the  next  command.  DRC  first  searches  for 
reverify  tiles.  Then  it  invokes  the  basic  checker  over  the  area  covered  by  each 
tile  found.  The  basic  checker  reverifies  the  area  on  each  of  the  cell’s  planes, 
updates  error  tiles,  and  erases  the  reverify  tile.  Changes  to  the  error  informa¬ 
tion  are  reflected  immediately  on  the  graphics  screen. 

If  the  designer  invokes  a  command  while  the  checker  is  running,  the 
checker  stops  so  that  the  command  can  be  processed  without  delay.  After  the 
command  finishes,  the  checker  resumes  by  starting  over  on  the  area  that  it 
was  working  on  just  before  the  interruption.  Large  reverify  tiles  are  broken 
up  into  small  ones  before  checking,  in  order  to  reduce  the  amount  of  work 
that  might  have  to  be  repeated.  When  there  are  large  areas  to  be  reverified, 
the  checker  works  across  the  design  in  a  style  like  "Pac-Man,”  gobbling  up 
reverify  tiles  and  spitting  out  error  tiles. 

If  incremental  checking  is  done  carelessly,  errors  may  not  be  detected 
when  new  violations  are  introduced,  and  error  information  may  be  left  in  the 
database  even  after  the  violations  have  been  corrected.  Figure  5  illustrates  the 
problem  and  Magic’s  solution.  When  an  area  is  modified,  error  information 
may  be  affected  in  both  the  area  that  was  modified  and  in  the  surrounding 
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area  (for  example,  material  in  area  A  may  be  too  close  to  something  in  the  sur¬ 
rounding  area  B).  We  call  the  surrounding  area  the  halo.  Its  width  is  equal  to 
the  largest  distance  in  any  design  rule.  Error  information  must  be  recomputed 
in  the  modified  area  and  its  halo.  However,  errors  in  the  halo  don’t  necessarily 
involve  the  inner  modified  area.  They  may  come  from  interactions  between 
the  halo  and  a  second  halo  outside  it.  To  regenerate  errors  in  the  first  halo 
correctly,  information  in  the  second  halo  must  be  considered. 

If  area  A  of  Figure  5  were  modified,  Magic  would  recheck  it  by  deleting 
all  error  information  in  A  and  B.  The  checker  would  then  generate  new  error 
information  in  both  areas  by  invoking  the  basic  checker  over  areas  A,  B  and 
C.  Any  errors  found  during  this  process  would  be  clipped  to  the  area  of  A  and 
B,  so  that  error  information  outside  the  region  where  errors  were  erased  would 
not  be  affected. 


first  halo 


second  halo 


Figure  6.  If  area  A  is  modified,  the  design-rule  checker  erases  existing  error  infor¬ 
mation  in  both  A  and  B.  Errors  in  B  could  have  come  from  information  in  A,  B  or 
C,  so  all  three  areas  must  be  checked  to  regenerate  ail  of  the  errors.  The  width  of 
the  halot  B  and  C  is  equal  to  the  largest  distance  in  any  design  rule. 
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The  reverify  and  error  tiles  are  stored  with  cells  so  that  they  are  not  lost 
at  the  end  of  an  editing  session.  Normally,  there  will  be  no  reverify  tiles  left  at 
the  end  of  a  session,  but  if  a  large  area  has  been  changed  recently,  it  b  possi¬ 
ble  that  it  won’t  have  been  reverified  when  the  session  ends.  In  thb  case,  the 
reverify  tiles  are  written  to  dbk  with  the  cell.  When  the  cell  b  read  in  during 
the  next  editing  session,  the  design-rule  checker  will  notice  the  reverify  tiles 
and  continue  the  reverification  process.  The  reverify  and  error  tiles  are  identi¬ 
cal  to  the  tiles  used  to  represent  mask  layers,  except  that  they  are  not  manipu¬ 
lated  directly  by  the  designer. 

5.  Hierarchical  Checking 

Most  of  the  layouts  created  with  Magic  consbt  of  hierarchical  cell  struc¬ 
tures  rather  than  single  ceUs  (Figure  6).  Each  cell  may  contain  subcelb,  and 
the  subcelb  may  overlap  other  subcelb  or  mask  information  in  the  parent.  A 
subcell  may  be  appear  any  number  of  times  in  any  number  of  parents. 

In  hierarchical  designs,  errors  can  arbe  in  any  of  three  ways: 

a)  the  mask  information  of  an  individual  cell  may  be  incorrect; 

b)  a  subcell  may  interact  incorrectly  with  another  subcell;  and 

c)  a  subcell  may  interact  incorrectly  with  mask  information  in  its  parents. 

Magic’s  incremental  checker  includes  facilities  to  detect  all  of  these  errors. 
Overlapping  subcelb  are  no  more  difficult  to  handle  than  subcells  that  merely 


Magic 's  Incremental  Design-Rule  Checker  December  7,  1983 


abut,  because  interaction  errors  are  possible  in  either  case. 


5.1.  Simple  Cheeks  and  Interaction  Checks 

Two  overall  rules  guide  the  hierarchical  checker.  First,  the  mask  infor¬ 
mation  in  every  cell  is  required  to  satisfy  the  design  rules  by  itself,  without 
consideration  of  subcells.  Second,  each  cell  and  its  subcells  must  together 
satisfy  all  the  design  rules,  without  consideration  of  how  that  cell  is  used  in  its 
parents.  If  the  layout  is  viewed  as  a  tree  structure,  the  first  rule  means  that 
each  node  of  the  tree  must  be  consistent,  and  the  second  rule  means  that  each 
subtree  must  be  consistent. 


Figure  6.  Circuits  are  defined  by  cells  arranged  in  a  hierarchy.  If  mask  information 
is  changed  in  a  low-level  cell,  Magic  checks  to  be  sure  that  the  cell  is  consistent  by 
itself  and  that  there  are  no  illegal  interactions  in  parents  or  grandparents. 
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The  overall  rules  result  in  two  kinds  of  design-rule  checking.  The  first 
rule  is  verified  by  running  the  basic  checker  over  the  planes  containing  mask 
information  for  each  cell;  this  is  called  a  simple  check.  The  second  rule  is 
verified  with  an  interaction  check  which  considers  interactions  involving  sub¬ 
cells.  Each  cell  uses  separate  planes  to  hold  its  mask  information,  so  interac¬ 
tion  checks  must  combine  information  from  different  planes. 

To  make  an  interaction  check  on  an  area,  the  hierarchical  structure  is 
“flattened”  to  produce  a  new  set  of  corner-stitched  planes  that  combines  all 
the  information  from  all  cells  in  the  area  to  be  checked.  This  includes  mask 
information  from  the  parent  cell,  plus  mask  information  from  subcells  and 
sub-subcells,  and  so  on.  Once  all  the  mask  information  in  the  area  has  been 
collected  into  a  single  set  of  planes,  the  basic  checker  is  invoked  on  these 
planes  in  the  standard  fashion  (halo  expansion  is  performed  as  described  in 
Section  4).  Errors  arising  from  the  interaction  check  are  placed  in  the  parent 
cell. 

Interaction  checks  are  more  expensive  than  basic  checks,  since  they 
involve  flattening  a  piece  of  the  hierarchy.  Fortunately,  interaction  checks  can 
often  be  avoided.  For  example,  if  an  area  contains  no  subcells,  then  there  is 
no  need  to  perform  an  interaction  check  on  that  area.  A  simple  check  will 
find  all  errors.  The  interaction  check  can  also  be  avoided  if  there  is  only  a  sin¬ 
gle  subcell  in  an  area,  with  no  other  subcells  or  mask  information  nearby.  In 
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this  case  any  errors  must  come  from  within  the  subcell,  and  those  errors  will 
be  found  by  checks  made  within  that  cell.  Interaction  checks  are  necessary 
only  in  areas  where  a  subcell  is  within  one  halo  distance  of  mask  information 
or  another  subcell.  Even  then,  we  only  need  to  check  the  the  area  around  the 
interaction. 

5.2.  Checking  Upward  in  the  Hierarchy 

When  a  cell  is  modified,  simple  checks  and  interaction  checks  have  to  be 
performed  within  that  cell,  and  also  within  its  parents  in  the  hierarchy.  For 
example,  suppose  mask  information  has  been  edited  within  a  cell.  Then  a  sim¬ 
ple  check  must  be  performed  within  that  cell,  as  well  as  an  interaction  check  if 
there  are  subcells  near  the  modified  area.  However,  these  two  checks  are  not 
sufficient.  If  the  modified  cell  is  a  subcell  of  other  higher-level  cells,  then  the 
change  may  have  introduced  interaction  problems  within  the  higher-level  cells. 
For  each  parent  of  the  modified  cell,  an  interaction  check  must  be  performed 
over  the  area  of  the  modification.  Interaction  checks  must  also  be  performed 
in  grandparents,  and  so-on  up  to  the  top-level  cell  in  the  hierarchy.  In  the  cell 
that  was  modified,  both  simple  and  interaction  checks  must  be  performed,  but 
in  the  parents  and  grandparents  only  interaction  checks  are  necessary. 

Magic  uses  two  kinds  of  verify  tiles  to  handle  the  two  kinds  of  checks. 
When  a  cell  is  modified,  “verify-all”  tiles  are  placed  in  that  cell  to  signify  that 
both  simple  and  interaction  checks  must  be  performed.  At  the  same  time, 
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“veriiy-interactions”  tiles  are  placed  in  parents  and  grandparents  to  indicate 
that  interaction  checks  have  to  be  performed.  The  background  checker  keeps 
track  of  which  cells  in  the  database  contain  verify  tiles  and  performs  each  kind 
of  check  wherever  necessary. 

In  the  worst  case,  the  hierarchical  algorithm  could  result  in  the  modified 
area  being  rechecked  once  at  each  level  of  the  hierarchy  above  the  cell  that 
was  changed,  with  a  separate  flatten  operation  required  for  each  check.  How¬ 
ever,  in  deep  hierarchies  most  of  the  interaction  checks  are  avoidable:  in  cells 
far  above  the  modified  one,  the  modified  area  will  almost  certainly  appear  in 
the  middle  of  a  single  subcell  with  no  mask  information  or  other  subcells 
nearby.  Unless  there  are  many  large  subcell  overlaps,  any  given  area  of  mask 
information  is  likely  to  require  an  interaction  check  at  only  one  point  in  the 
hierarchy. 


5.3.  Arrays 


One  other  form  of  hierarchical  check  arises  because  Magic  has  an  array 
construct.  To  simplify  the  creation  of  cell  arrays,  Magic  contains  a  special 
array  facility:  each  subcell  may  consist  of  either  a  single  instance  or  a  one-  or 
twodimensional  array  of  identical  instances.  Because  of  the  array  construct, 
there  is  actually  a  third  overall  rule  that  guides  the  hierarchical  checker:  each 
array  must  satisfy  all  the  design  rules,  independently  of  other  information  in 
the  parent  containing  the  array.  Whenever  a  change  is  made  to  an  array,  the 
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areas  to  be  checked 
surround  overlaps 
by  one  halo 


i  i 

Figure  7.  An  array  is  internally  consistent  if  the  three  dotted  areas  satisfy  the 
design  rules.  All  possible  interactions  between  elements  of  the  array  are  identical  to 
the  ones  that  occur  these  three  regions. 

array  structure  is  reverified  by  checking  the  three  areas  shown  in  Figure  7. 

6.  Implementation  and  Performance 

The  design-rule  checker  is  written  in  C.  Its  2000  lines  of  code  are  divided 
into  roughly  equal  thirds  for  building  the  internal  rule  table  from  the  technol¬ 
ogy  file,  implementing  the  basic  checker  on  one  plane,  and  providing  for 
hierarchical  checking. 

The  incremental  checking  system  has  just  recently  become  operational. 
We’ve  made  preliminary  measurements  on  single  cells  with  the  untuned  sys¬ 
tem.  The  basic  checker  processes  200  tiles  per  second  on  a  VAX  11/780  run¬ 
ning  Unix.  To  compare  Magic’s  performance  with  that  of  other  systems,  we 
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state  their  speeds  in  terms  of  transistors  checked  per  second  in  Table  2. 

A  typical  change  to  a  circuit  involves  only  a  few  tiles,  so  the  cost  of  incre¬ 
mental  reverification  is  dominated  by  the  size  of  the  halos.  From  this,  we  esti¬ 
mate  that  roughly  50  tiles  have  to  be  checked  per  command  in  an  nMOS 
design.  This  requires  about  one-fourth  of  a  second  of  CPU  time. 

The  average  number  of  edges  found  per  tile  is  2.5,  but  only  1.8  of  these 
have  different  mask  types  on  the  two  sides  of  the  edge.  An  average  of  1.7 
rules  are  applied  per  non-trivial  edge. 

7.  Conclusions 

Magic’s  design-rule  checker  demonstrates  that  incremental  checking  is 
feasible.  We  think  that  circuit  designers  will  find  that  continuous  feedback 
reduces  the  time  needed  to  create  new  designs  or  modify  existing  ones.  The 
key  to  the  incremental  checker  is  low  overhead:  the  ability  to  run  from  the 
same  database  as  the  interactive  editor,  the  ability  to  find  important  edges  in 
the  layout  quickly,  and  the  ability  to  find  nearby  material  quickly.  The  two 


System 

Transistors  /  second 

Lyra  |2| 

2 

Baker  [1] 

3 

Mart  [5] 

6-8 

Magic 

10-15 

Table  2.  Performance  of  several  desigo  rule  checkers.  All  of  the  programs  were  run 
on  a  VAX  11/780. 
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features  of  Magic’s  database  that  reduce  overhead  are  the  corner-stitched  tile 
planes  and  the  abstract  mask  layers.  Extending  the  checker  to  work  in 
hierarchical  designs  frees  the  designer  from  tedious  reverification  of  interac¬ 
tions  when  subcells  are  revised. 
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10.  Appendix 

To  illustrate  how  Magic  is  programmed  for  a  particular  technology,  this 
section  lists  the  design  rules  for  an  nMOS  process  with  buried  contacts  and  a 
single  level  of  metal.  Most  rules  are  specified  using  width  and  spacing  macros 
which  Magic  expands  into  detailed  lower-level  rules.  Detailed  edge  and  four¬ 
way  rules  may  also  be  specified  directly.  Table  3  gives  the  abbreviations  that 
we  use  for  the  names  of  mask  types. 


Poly /Diffusion  plane: 

s 

space 

d 

diffusion 

P 

polysilicon 

D 

diffusion-metal  contact 

P 

poly  silicon-metal  contact 

B 

buried  contact 

e 

enhancement  transistor 

f 

depletion  transistor 

Metal  plane: 

s 

space 

m 

metal 

X 

metal-diffusion  contact 

Y 

metal-poiysilicon  contact 

Table  3.  Single  letter  abbreviations  for  the  names  of  mask  types. 

The  rules  in  Table  4a  define  minimum  line  widths  and  feature  sizes.  The 
first  three  rules  are  for  the  line  widths  of  diffusion,  metal  and  polysilicon.  The 
last  five  rules  define  the  sizes  of  contacts  and  transistors.  The  types  field  may 
include  one  or  more  mask  types.  Magic  creates  a  detailed  edge  rule  for  all 
combinations  of  one  member  of  the  types  field,  and  one  of  the  mask  types  in 
the  same  plane  that  is  not  included  in  the  types  field. 
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types 

d 

rwww 

width 

dDBef 

2 

Sffmian 

width 

pPBef 

2 

ptdjmliem 

width 

mXY 

3 

metml 

width 

D 

4 

*fj/m**  <mut* 

width 

P 

4 

feif/mtUi  contact 

width 

B 

2 

width 

e 

2 

*/«* 

width 

r 

2 

4M 

Tmble  4m.  Width  rales. 

Table  4b  contains  spacing  rules.  W e  distinguish  between  spacing  rules  for 
types  that  can  never  be  adjacent  and  spacing  rules  that  apply  only  when  two 
pieces  of  material  are  separated.  In  either  case,  Magic  creates  a  number  of 
detailed  edge  rules  in  a  manner  similar  to  that  for  width  rules. 

The  width  and  spacing  macros  can  be  used  to  specify  most  symmetrical 
constraints  for  a  particular  technology.  The  detailed  edge  rules  created  from 
the  width  and  spacing  macros  are  applied  only  from  left-to-right  across 


types  1 

types  2 
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vertical  edges  in  the  layout,  and  from  bottom-to-top  across  horizontal  edges. 
These  edge  rules  always  check  one  corner,  also. 

To  specify  asymmetrical  constraints  and  constraints  that  apply  alongside 
edges  but  not  in  corners,  we  use  the  explicit  edge  and  fourway  rules  listed  in 
Table  4c.  The  fourway  rules  are  applied  in  both  directions  across  all  edges  in 
the  layout.  They  also  trigger  corner  checks  on  both  ends  of  every  edge.  The 
edge  rules  in  Table  4c  are  similar  to  the  ones  derived  from  the  width  and  spac¬ 
ing  macros,  but  could  not  be  written  conveniently  in  either  of  those  forms. 
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Abstract 

The  Magic  layout  editor  provides  a  new  operation  called  plowing,  for  stretch¬ 
ing  and  compacting  Manhattan  VLSI  layouts.  Plowing  works  directly  on  the 
mask-level  representation  of  a  layout,  allowing  portions  of  it  to  be  rearranged 
while  preserving  connectivity  and  layout-rule  correctness.  The  layout  and 
connectivity  rules  are  read  from  a  file,  so  plowing  is  technology  independent. 
Plowing  is  fast  enough  to  be  used  interactively.  This  paper  presents  the  plow¬ 
ing  operation  and  the  algorithm  used  to  implement  it. 
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1.  Introduction 


Plowing  is  a  new  operation  provided  by  the  Magic  layout  editor 
[OHMST  84]  for  stretching  and  compacting  Manhattan  VLSI  layouts.  It 
allows  designers  to  make  topological  changes  to  a  layout  while  maintaining 
connectivity  and  layout  rule  correctness.  Plowing  can  be  used  to  rearrange 
the  geometry  of  a  subcell,  compact  a  sparse  layout,  or  open  up  new  space  in  a 
dense  layout.  In  a  hierarchical  environment  plowing  also  allows  cell  placement 
to  be  modified  incrementally  without  the  need  for  rerouting.  To  avoid  depen¬ 
dence  on  a  particular  technology,  plowing  is  parameterized  by  a  set  of  layout 
and  connectivity  rules  contained  in  a  technology  file. 


(before) 
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Figure  1.  Plowing  opens  up  new  space  in  a  dense  layout.  Geometry  is  pushed  in 
front  of  the  plow,  subject  to  layout-rule  constraints.  The  connectivity  of  the  original 
layout  is  maintained.  Jogs  are  inserted  automatically  where  necessary. 
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Conceptually  the  plowing  operation  is  very  simple.  The  user  places  either 
a  vertical  or  a  horizontal  line  segment  (the  plow)  over  some  part  of  a  mask- 
level  representation  of  the  layout,  and  then  gives  the  direction  and  the  dis¬ 
tance  the  plow  is  to  move.  Plowing  can  be  done  up,  down,  to  the  left,  or  to 
the  right.  (The  rest  of  this  paper  will  assume  plowing  to  the  right.)  The  plow 
is  then  moved  through  the  layout  by  the  distance  specified.  It  catches  vertical 
edges  (boundaries  between  materials)  as  it  moves  and  carries  them  along  with 
it.  Since  only  edges  are  moved,  material  behind  the  plow  is  stretched  and 
material  in  front  of  the  plow  is  compressed.  Figure  1  shows  how  plowing  can 
be  used  to  open  up  new  space.  Figure  2  shows  how  it  can  be  used  for  stretch¬ 
ing.  Plowing  can  be  used  to  compact  an  entire  cell  by  placing  a  plow  to  the 
left  and  plowing  right,  then  placing  a  plow  at  the  top  and  plowing  down. 


Figure  2.  Material  to  the  left  of  the  plow  is  stretched.  Material  to  the  right  is 
compressed.  Objects  such  as  transistors  do  not  change  in  size. 
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Plowing  is  so  named  because  each  of  the  edges  caught  by  the  plow  can 
cause  edges  in  front  of  it  to  move  in  order  to  maintain  connectivity  and 
layout-rule  correctness.  These  edges  can  cause  still  others  to  be  moved  out  of 
the  way,  recursively,  until  no  further  edges  need  be  moved.  A  mound  of  edges 
thus  builds  up  in  front  of  the  plow  in  much  the  same  manner  as  snow  builds 
up  on  the  blade  of  a  snowplow. 

Section  2  of  this  paper  discusses  plowing  in  the  context  of  previous  work. 
Sections  3  and  4  introduce  the  plowing  algorithm  for  a  single  mask  layer.  Sec¬ 
tion  5  extends  it  to  multiple  mask  layers  and  hierarchical  designs.  Finally, 
Section  6  presents  performance  measurements  and  our  experience  with  plowing 
in  the  Magic  system. 

2.  Background 

VLSI  layouts  are  difficult  to  modify.  Because  of  this,  designers  are  often 
committed  to  the  initial  choice  of  implementation,  rather  than  being  able  to 
experiment  with  alternatives.  Existing  cells  often  cannot  be  re-used  in  subse¬ 
quent  designs  because  they  don’t  quite  fit;  it  is  typically  easier  to  redesign  a 
new  cell  from  scratch  than  to  modify  an  old  one.  Bugs  in  a  dense  layout  are 
hard  to  fix,  leading  to  a  debugging  cycle  which  can  take  days. 

Many  of  these  difficulties  stem  from  the  fact  that  seemingly  small  changes 
to  a  layout  can  have  disproportionately  large  effects.  Sometimes  this  is  for 
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electrical  reasons.  For  example,  in  ratio  logic  such  as  nMOS,  changes  in  the 
size  of  one  transistor  may  necessitate  changes  in  the  sizes  of  others.  However, 
even  purely  topological  changes — those  which  preserve  the  electrical  properties 
of  the  layout — can  require  much  more  work  than  the  size  of  the  change  would 
suggest.  As  Figure  1  illustrated,  merely  opening  up  new  space  in  a  layout  can 
cause  effects  w’hich  ripple  outward  over  a  much  larger  area.  Rearranging  the 
internal  geometry  of  a  cell  or  modifying  the  placement  of  cells  in  a  floor  plan 
can  be  similarly  expensive  because  of  the  need  to  maintain  connectivity  with 
the  surrounding  material. 

Previous  attempts  to  cope  with  the  re-arrangement  problem  have  used 
symbolic  design  or  sticks  [RBDD  83,  West  81,  Will  78].  In  the  symbolic/sticks 
approach,  designers  enter  layouts  in  an  abstract  form  containing  zero-width 
wires,  contacts,  and  transistors.  The  sticks  form  is  then  run  through  a  com¬ 
pactor  to  generate  actual  mask  information.  As  part  of  the  compaction,  the 
circuit  elements  are  moved  as  close  together  as  the  layout  rules  permit.  In  a 
sticks  design  style,  cells  can  be  designed  loosely  without  worrying  about  exact 
spacings,  since  the  spacings  will  be  determined  by  the  compactor.  However,  it 
is  not  necessarily  easy  to  make  major  changes  to  a  sticks  cell  once  it  has  been 
entered.  Virtual  grid  systems  like  Mulga  and  VIVID  provide  mechanisms  for 
adding  new  grid  lines  uniformly  across  a  cell,  but  it  is  still  difficult  to  make 
large  topological  changes. 
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The  plowing  approach  has  all  the  advantages  of  sticks.  It  allows  cells  to 
be  designed  loosely  and  then  compacted.  In  addition,  plowing  can  be  used  to 
rearrange  cells  or  open  up  new  space,  either  across  the  whole  cell  or  in  one 
small  portion.  Small  changes  can  be  made  in  one  area  without  having  to 
recompact  the  entire  cell  (a  global  recompaction  may  potentially  shift  every 
geometry  in  the  cell).  The  plowing  approach  lets  the  designer  see  the  final 
sizes  and  locations  of  all  objects  as  he  is  editing;  in  the  sticks  approach,  it  is 
hard  to  predict  the  final  structure  of  a  cell  from  its  abstract  form,  so  compac¬ 
tion  must  be  used  frequently  to  see  the  results  of  a  change  to  the  sticks. 

3.  Simple  plowing  algorithm 

Plowing  works  by  finding  edges  and  moving  them.  An  edge  is  a  boun¬ 
dary,  parallel  to  the  plow,  between  material  of  two  different  types.  When  an 
edge  moves,  the  material  to  its  left  is  stretched,  and  the  material  to  its  right  is 
compressed.  In  this  section  we  will  describe  how  plowing  works  when  only  a 
single  mask  layer  is  present.  This  material  will  be  assumed  to  have  a 
minimum  width  of  tv ,  and  a  minimum  separation  of  s.  Edges  will  always  be 
boundaries  between  this  material  and  “empty”  space. 

The  fundamental  step  in  plowing  is  to  move  a  single  edge.  This  step 
involves  determining  which  other  edges  must  move  as  a  consequence  of  this 
motion.  The  following  discussion  presents  plowing  as  though  it  moves  a  given 
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edge  by  first  recursively  sweeping  all  other  edges  out  of  its  way,  and  then  slid¬ 
ing  the  edge  into  the  newly  opened  space.  Section  4  will  present  a  better 
scheme  for  ordering  edge  motions  than  this  depth-first  recursion. 

3.1.  Finding  edges 

Figure  3  depicts  a  trivial  layout  consisting  of  three  unconnected  pieces  of 
diffusion.  The  edge  labelled  t  is  to  be  moved  to  a  final  position  indicated  by 
the  arrowhead.  This  could  be  either  because  t  was  caught  by  the  plow,  or 
because  it  is  being  moved  to  make  room  for  some  edge  to  its  left.  At  a  very 
minimum,  the  rectangular  area  labelled  A  must  be  swept  clear  of  any  material 
before  the  edge  can  be  moved.  However,  because  of  the  spacing  rule,  any 
material  inside  area  B  would  then  be  too  close  to  the  newly  moved  edge.  Con- 


diffusion 


Figure  3.  When  the  edge  e  moves,  all  edges  in  area  A  (the  area  swept  out  by  e) 
must  be  moved  (a).  Moving  only  these  edges  results  in  edge  /  moving  but  not  edge 
g.  This  leaves  a  layout-rule  violation  (b)  between  t  and  g.  Searching  area  B  as  well 
as  area  A  avoids  this  problem.  The  two  areas  are  referred  to  collectively  as  the  um¬ 
bra  of  edge  e. 
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sequently,  the  area  to  be  swept  includes  both  areas  A  and  B.  The  union  of 
these  two  areas  is  referred  to  as  the  umbra  of  the  edge  e*. 

Plowing  must  also  search  above  and  below  the  umbra  to  prevent  the  edge 
from  sliding  too  close  to  other  edges  above  or  below  it.  Figure  4a  shows  why 
this  is  necessary.  If  material  were  moved  out  of  the  umbra  alone,  as  in  Figure 


4b,  the  result  is  electrical  disconnection.  To  avoid  this,  plowing  must  also 
move  edges  out  of  the  areas  above  and  below  the  umbra.  The  correct  result  is 


penumbra 


Figure  4.  When  the  edge  e  moves  (a),  edges  in  its  umbra  must  be  moved  to  the 
right.  If  only  edges  in  the  umbra  are  moved,  however,  the  result  can  be  electrical 
disconnection  (b).  To  avoid  this,  plowing  also  moves  edges  in  the  penumbra  to  the 
right,  giving  the  correct  result  shown  in  (c).  This  has  the  effect  of  inserting  jogs  au¬ 
tomatically.  The  height  of  the  penumbra  is  w,  the  minimum-width  for  diffusion.  If 
diffusion  had  been  to  the  left  of  e  instead  of  to  the  right,  the  height  of  the  penumbra 
would  have  been  j,  minimum-separation. 


*  In  a  solar  eclipse,  the  umbra  is  that  portion  of  the  moon's  shadow  from  which  the  sun 
appears  to  be  completely  eclipsed.  The  penumbra  is  the  part  of  the  shadow  surrounding  the 
umbra  from  which  the  sun  appears  only  partially  eclipsed.  In  plowing,  the  umbra  contains 
edges  directly  in  the  path  of  an  edge  being  moved,  while  the  penumbra  contains  edges  not  in 
the  path  but  nonetheless  too  close. 
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shown  in  Figure  4c.  The  areas  above  and  below  the  umbra  are  referred  to  col¬ 
lectively  as  the  penumbra.  Jog  insertion  is  an  automatic  consequence  of 
searching  the  penumbra.  Moving  edges  out  of  the  penumbra  also  prevents 
electrical  shorts,  as  can  be  seen  by  reversing  the  roles  of  material  and  space  in 
Figures  4a-4c. 

The  left-hand  boundary  of  the  penumbra  is  not  always  aligned  with  the 
edge  being  moved.  Instead,  this  boundary  is  formed  by  following  the  outline 
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Figure  5.  If  e’s  penumbra  included  all  of  area  A,  as  shown  in  (a),  then  edge  /  would 
be  found  and  moved,  resulting  in  (b).  This  is  undesirable,  since  /  need  not  move  in 
order  to  preserve  layout-rule  correctness  and  connectivity.  A  better  definition  of  the 
penumbra  would  be  area  B  only,  as  shown  in  (c).  Searching  this  area  would  result  in 
only  the  edge  g  being  found  and  moved,  as  is  necessary  to  preserve  layout  rule 
correctness. 
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of  the  material  forming  the  edge,  as  illustrated  in  Figure  5.  This  insures  that 
the  penumbra  contains  only  those  edges  which  must  move  in  order  to  preserve 
layout  rule  correctness  and  connectivity.  The  umbra  and  penumbra  of  an 
edge  are  collectively  referred  to  as  its  shadow.  The  shadow  of  e  contains  all 
the  edges  which  must  move  as  a  direct  consequence  of  moving  e. 


3.2.  Sliver  prevention 

The  rules  described  in  Section  3.1  guarantee  that  plowing  never  moves 
one  vertical  edge  too  close  to  another.  However,  they  do  allow  violations  to  be 
introduced  between  horizontal  segments  that  are  formed  when  material  is 
stretched.  These  violations  take  the  form  of  slivers  of  material  or  space  whose 
height  is  less  than  the  minimum  allowed.  Eliminating  such  slivers  requires 
that  their  left-hand  edges  be  moved,  as  illustrated  in  Figure  6.  The  left-hand 
edge  of  each  sliver  lies  along  the  left-hand  boundary  of  the  penumbra,  so  it 
can  be  found  when  tracing  the  outline  of  the  penumbra. 


Figure  0.  When  the  edge  e  mores  (a),  a  sliver  of  space  is  introduced  below  the  hor¬ 
izontal  segment  h,  as  shown  in  (b).  To  correct  this,  the  left-hand  edge  of  this  sliver, 
/,  is  moved  along  with  e,  but  only  as  far  as  the  right-hand  end  of  the  segment  h  (c). 
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Figure  7.  This  lattice  structure  causes  exponential  worst-case  behavior  in  the 
depth-first  plowing  algorithm  when  edges  in  the  shadow  are  processed  from  top  to 
bottom.  The  objects  (A,  B,  etc.)  must  be  incompressible  to  cause  this  worst-case 
behavior.  Object  B  is  moved  once  when  object  A  moves,  then  slightly  farther  when 
object  C  moves.  The  numbers  to  the  left  of  each  object  show  how  many  times  each 
of  its  edges  is  moved. 

4.  Breadth-first  vs.  Depth-first  Search 


In  the  previous  section,  plowing  was  described  as  a  depth-first  search  in 
which  all  edges  to  the  right  of  a  given  edge  were  moved  before  the  edge  itself. 
While  this  approach  is  conceptually  clear,  it  has  poor  worst-case  behavior.  An 
N-tier  lattice  structure  as  illustrated  in  Figure  7  requires  on  the  order  of  2N 
edge  motions,  because  plowing  performs  the  recursive  search  to  the  right  of  an 
edge  each  time  the  edge  is  moved.  If,  as  in  the  example,  each  edge  must  be 
moved  once  for  each  of  its  two  neighbors  to  the  left,  the  edges  at  the  right- 
hand  side  of  the  lattice  are  moved  a  number  of  times  that  is  exponential  in  the 


number  of  tiers. 
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Instead,  plowing  waits  until  the  final  position  of  an  edge  is  known  before 
it  performs  the  search  to  the  right  of  that  edge.  This  strategy  causes  the 
number  of  edge  motions  to  be  linear  in  the  number  of  edges  in  tL.  lattice.  (A 
detailed  explanation  is  given  in  [Oust  84].) 

A  simple  way  to  insure  that  edges  are  moved  only  .once  their  final  posi¬ 
tions  are  known  is  to  use  breadth-first  search.  Magic  maintains  a  list  of  edges 
to  be  moved,  sorted  in  order  of  increasing  x-coordinate.  On  each  iteration,  the 
leftmost  edge  is  removed  from  the  list  and  the  shadow  to  its  right  is  searched. 
Any  edges  discovered  by  this  search  are  placed  in  the  list  along  with  the 
amount  they  must  move.  Since  the  final  position  of  an  edge  can  only  be 
affected  by  edges  to  its  left,  the  final  position  of  the  leftmost  edge  in  the  list  is 
always  known. 

The  depth-first  algorithm  allowed  the  layout  to  be  modified  incrementally 
as  plowing  progressed,  since  an  edge  was  never  moved  until  the  area  into 
which  it  was  moving  had  been  cleared.  Incremental  modification  is  impossible 
with  breadth-first  search,  since  edges  to  the  right  will  not  be  moved  as  long  as 
there  are  queued  edges  to  the  left  of  them  waiting  to  be  moved.  Instead  of 
actually  updating  the  layout  as  it  progresses,  the  breadth-first  version  of  plow¬ 
ing  stores  with  each  vertical  edge  segment  the  distance  it  moves.  When  the 
shadows  of  all  edges  have  been  searched,  and  the  distance  each  edge  moves 
has  been  determined,  plowing  invokes  a  post-pass  to  update  the  layout  from 
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with  it  the  distance  it  is  going  to  move.  As  a  consequence,  plowing  can  use 
the  initial  layout  structure  for  searching,  and  yet  can  easily  find  all  objects 
whose  final  coordinates  fall  in  a  given  area. 

5.  Extensions  for  real  layouts 

This  section  extends  the  simple  plowing  algorithm  of  the  previous  two  sec¬ 
tions  to  handle  multiple  mask  layers.  Plowing  is  also  extended  to  handle 
features,  such  as  transistors  and  contacts,  whose  size  should  not  be  changed, 
and  to  allow  noninteracting  mask  layers,  such  as  metal  and  polysilicon,  to  slide 
past  each  other.  Finally,  since  layouts  in  Magic  may  be  hierarchical,  this  sec¬ 
tion  closes  with  a  description  of  how  plowing  handles  hierarchy. 

5.1.  Multiple  m&sk  layers 

The  simple  version  of  plowing  assumed  that  the  shadow  extended  to  the 
right  of  the  final  position  of  a  moving  edge  by  either  w  (the  minimum-width 
rule)  if  material  lay  to  the  right  of  the  edge,  or  a  (the  minimum-separation 
rule)  if  material  lay  to  the  left  of  the  edge.  This  insured  that  the  shadow 
included  all  edges  directly  in  the  path  of  the  edge  being  moved.  Since  the 
same  layout  rule  applied  between  the  edge  being  moved  and  any  other  edge, 
all  edges  found  during  the  search  of  the  shadow  would  have  to  move. 

With  more  than  one  mask  layer  there  may  be  more  than  one  layout  rule 
to  apply  for  a  given  edge.  For  example,  in  our  nMOS  process,  the  minimum 
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separation  between  diffusion  and  polysilicon  is  2  microns,  while  that  between 
two  pieces  of  diffusion  is  6  microns.  Both  of  these  rules  apply  at  an  edge 
between  diffusion  and  empty  space. 

To  insure  that  the  shadow  contains  all  edges  which  must  move,  the  sha¬ 
dow  must  extend  beyond  the  area  the  edge  sweeps  out  by  the  worst-case  lay¬ 
out  rule  distance  applying  to  that  edge.  As  Figure  9  illustrates,  however,  not 
all  of  the  edges  found  in  the  shadow  search  will  actually  need  to  move.  Each 
edge  found  must  be  checked  for  its  minimum  allowable  separation  from  the 
edge  being  moved.  Fortunately,  this  can  be  done  very  quickly  using  the  same 
techniques  as  those  used  in  Magic’s  incremental  layout-rule  checker  [TaOu  84]. 


Figure  0.  The  area  of  a  shadow  search  is  determined  by  the  worst-case  layout  rule. 
However,  not  all  edges  in  that  area  will  have  to  be  moved.  Edge  /  must  move,  be¬ 
cause  the  separation  between  two  polysilicon  features  must  be  4  microns  and  edge  e 
approaches  to  within  2  microns  of  /.  Edge  g  need  not  move  since  the  minimum 
separation  between  polysilicon  and  diffusion  is  only  2  microns. 
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| -  penumbra  for  A 


Figure  10.  An  edge  between  two  different  types  of  material  has  a  penumbra  for 
each.  The  spacing  rules  for  material  of  type  A  are  applied  in  A’s  penumbra.  The 
minimum-width  rule  for  material  of  type  B  is  applied  in  B' s  penumbra.  The  sizes  of 
each  penumbra  may  be  different  because  of  the  different  layout  rules  applied  in  each. 

If  the  edge  being  moved  has  material  on  both  sides,  there  is  really  a 
penumbra  for  each  type  of  material.  The  layout  rules  applied  while  searching 
each  penumbra  will  in  general  be  different.  Slivers  must  be  prevented  along 
the  boundaries  of  both  penumbra.  See  Figure  10  for  an  example. 


Figure  11.  If  edge  e  is  plowed,  material  A  may  disconnect  from  B  and  C.  To 
prevent  this,  a  minimum-width  segment  of  edges  /  and  g  is  dragged  along  with  e. 
The  edge  g  is  moved  not  to  maintain  connectivity  (which  would  have  been  achieved 
by  moving  h),  but  to  prevent  C  from  being  uncovered.  In  (c),  ml  is  the  lesser  of  the 
minimum  widths  for  A  and  B,  m2  is  the  minimum  width  for  B,  and  m3  is  the 
minimum  width  for  C. 
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Multiple  mask  layers  require  extra  caution  to  maintain  connectivity  with 
material  above  and  below  an  edge  being  moved.  In  the  single-layer  scheme, 
the  penumbra  search  guarantees  that  the  material  does  not  become  discon¬ 
nected.  However,  the  penumbra  search  follows  the  outline  of  a  single  type  of 
material,  so  it  will  not  by  itself  guarantee  that  two  adjacent  materials  of 
different  types  will  remain  connected  (see  Figure  11). 

Special  actions  must  be  taken  during  the  penumbra  search  to  handle  hor¬ 
izontal  edges  between  different  materials.  First,  if  two  materials  share  a  hor¬ 
izontal  edge,  then  Magic  guarantees  that  one  material  does  not  slide  past  the 
end  of  the  other:  it  maintains  a  minimum-width  connection  between  the  two 
(this  is  the  case  between  materials  A  and  B  in  Figure  11).  Second,  if  one 
material  completely  covers  the  edge  with  another  material  (for  example,  the 
A-C  edge  in  Figure  11),  Magic  plows  the  other  material  as  much  as  is  needed 
to  maintain  complete  coverage.  This  ensures,  for  example,  that  transistors 
don’t  get  uncovered  by  plowing  polysilicon  off  one  side. 

5.2.  Inelastic  features 

Certain  features  in  a  layout  should  not  be  stretched  or  compacted. 
Transistors,  for  example,  have  sizes  chosen  for  electrical  reasons,  as  do  con¬ 
tacts.  Our  discussion  of  edge  motion  has  assumed  that  the  material  forming 
both  sides  of  the  edge  was  stretchable.  When  material  is  inelastic,  both  its 
left-hand  and  right-hand  edges  must  be  moved  in  tandem.  In  particular,  if  the 
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Figure  12.  When  inelastic  objects  are  present,  plowing  may  have  to  cope  with  cir¬ 
cular  dependencies.  Material  B  is  inelastic,  and  A  and  C  are  both  minimum-width. 
When  edge  t  moves  by  distance  d  in  (a),  object  B  must  move  by  the  same  distance 
to  prevent  A  from  being  uncovered.  To  prevent  C  from  being  uncovered,  C"s  left- 
hand  edge  must  move,  finally  causing  edge  /  to  move  by  distance  d.  Edge  e  is  in  /s 
shadow  as  a  result,  but  should  not  be  moved  a  second  time. 


right-band  edge  of  a  piece  of  inelastic  material  moves,  its  left-hand  edge  must 
move  also. 


A  consequence  of  inelasticity  is  that  moving  an  edge  can  cause  motion  of 
edges  to  its  left,  possibly  resulting  in  a  circular  dependency.  The  example  in 
Figure  12  illustrates  such  a  dependency.  The  depth-first  plowing  algorithm  is 
completely  incapable  of  resolving  such  a  dependency.  The  breadth-first  algo¬ 
rithm  resolves  it  by  comparing  the  amount  an  edge  is  supposed  to  move  with 
the  motion  distance  already  stored  with  the  edge.  If  the  stored  motion  dis¬ 
tance  is  greater,  the  edge  need  not  be  moved  a  second  time. 

If  the  distance  d  between  edges  /  and  t  in  Figure  12  is  less  than  s,  the 
minimum  separation  allowed  (ie,  there  is  currently  a  layout  rule  violation), 
looking  at  the  motion  distance  of  e  is  insufficient.  When  the  shadow  of  /  is 


Plowing 


December  2,  1983 


searched,  plowing  is  supposed  to  move  all  edges  found  far  enough  away  so  that 
they  cause  no  rule  violations  with  the  newly  moved  /.  This  would  mean  that 
edge  t  would  have  to  move  by  d+s-r,  which  is  more  than  the  motion  distance 
stored  with  the  edge.  As  a  result,  the  plowing  algorithm  loops  infinitely,  each 
time  moving  edge  e  by  an  additional  s-r.  To  avoid  infinite  walks,  plowing 
never  moves  a  shadowed  edge  (eg,  e)  more  than  the  edge  causing  the  shadow 
(eg,  f).  This  technique  prevents  infinite  looping,  but  preserves  layout  rule  vio¬ 
lations  existing  in  the  original  layout. 

5.3.  Nonlnteracting  planes 

Section  4  explained  that  the  order  of  vertical  edges  along  a  horizontal  line 
is  unchanged  by  plowing.  Thus  material  being  plowed  can  never  slide  over 
other  material  in  its  path.  There  are  cases,  however,  where  it  is  desirable  that 
certain  materials  in  a  layout  move  independently.  Metal,  for  example,  does 
not  interact  with  either  polysilicon  or  diffusion  except  at  contacts,  so  it  should 
be  able  to  slide  over  them. 

To  allow  sliding,  Magic  segregates  the  mask  information  in  a  layout  into  a 
collection  of  non-interacting  planes.  Material  in  one  plane  is  free  to  slide  past 
material  in  any  other  plane.  The  nMOS  technology,  for  example,  has  two 
planes:  one  to  hold  metal  wires,  and  one  to  hold  polysilicon,  diffusion,  and 
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Figure  13.  A  contact  is  duplicated  on  each  plane  it  connects.  When  an  edge  of  a 
contact  is  moved  on  one  plane,  it  is  moved  on  ail  other  planes  as  well. 

The  plowing  algorithm  operates  on  each  plane  independently.  The  only 
interaction  between  planes  occurs  at  contacts,  which  are  duplicated  in  each 
plane  that  they  connect.  When  an  edge  of  a  contact  is  moved  in  one  plane, 
the  corresponding  edge  of  the  contact  in  all  other  planes  is  moved  by  the  same 
amount,  as  illustrated  in  Figure  13.  This  also  moves  whatever  the  contact 
connects  to  in  the  other  planes,  thus  preserving  connectivity. 

5.4.  Subcells  and  hierarchy 

One  approach  for  plowing  a  hierarchical  layout,  such  as  that  shown  in 
Figure  14a,  is  to  treat  it  as  though  it  were  non-hierarchical  and  propagate 
edge  motions  inside  subcells.  This  might  be  workable  when  no  subcell  is  used 
more  than  once.  However,  Magic  instantiates  subcells  by  reference,  so  a 
change  in  one  instance  of  a  subcell  is  reflected  in  all  its  other  uses.  Situations 
in  which  a  subcell  is  used  more  than  once  can  produce  unsatisflable  sets  of 
constraints,  as  Figure  14b  illustrates. 
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Figure  14.  Plowing  in  the  presence  of  hierarchy,  (a)  Plowing  might  treat  hierarchy 
as  though  it  were  invisible  to  the  user.  Each  of  cells  A  and  B  would  be  modified,  (b) 
Cell  C  is  used  twice,  once  flipped  Ieft-to-right  and  once  in  its  normal  orientation. 
Both  uses  refer  to  the  same  master  definition  of  C.  Moving  edge  e  to  the  right  is  im- 
possible,  because  it  requires  e  to  move  to  the  left  in  order  to  keep  out  of  its  own 
path.  The  more  edge  e  is  moved  to  the  right  in  the  left-hand  use,  the  worse  the  vio¬ 
lation  becomes. 


Magic  takes  a  simpler  approach,  which  is  to  view  subcells  as  black  boxes 
to  which  connectivity  must  be  maintained  by  plowing,  but  whose  internal 
structure  should  not  be  modified.  A  consequence  of  Magic’s  approach  is  that 
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plowing  can  be  used  to  modify  the  placement  of  cells  at  the  floor  plan  of  a 
chip,  since  it  only  changes  the  location  of  subcells,  not  their  contents. 

When  any  mask  geometry  that  abuts  or  overlaps  a  cell  is  moved,  the 
entire  cell  must  move  by  the  same  amount.  Conversely,  whenever  a  subcell 
moves,  all  mask  geometry  and  other  subcells  that  abut  or  overlap  it  must  also 
move  by  the  same  amount.  The  net  effect  is  that  a  cell  behaves  like  flypaper, 
causing  all  geometry  over  its  area  to  “stick”  to  it  and  move  as  a  whole  when 
any  part  of  it  is  required  to  move. 

In  addition  to  preserving  connectivity  with  subcells,  when  plowing  moves 
other  geometry  it  must  avoid  introducing  any  lahout  rule  violations  with  the 
geometry  inside  a  subcell.  One  approach  for  dealing  with  this  is  to  define  a 
protection  frame  [Kell  82]  for  each  cell,  an  outline  around  the  cell  into  which 
no  material  may  be  plowed.  Magic  uses  an  extremely  simple  form  of  protec¬ 
tion  frame:  it  assumes  that  the  cell  contains  all  types  of  material  right  up  to 
the  border  of  its  bounding  box. 

For  example,  in  our  nMOS  rule  set,  the  worst-case  layout  rule  involving 
diffusion  is  the  diffusion-diffusion  spacing  rule  of  6  microns.  An  edge  with 
diffusion  to  its  left  can  be  plowed  to  within  6  microns  of  a  subcell  before  that 
subcell  will  itself  have  to  move.  The  worst-case  rule  distance  involving  polysil¬ 
icon  is  8  microns,  so  polysilicon  can  only  be  plowed  to  within  8  microns  of  a 
subcell  before  the  cell  must  move.  Since  the  contents  of  subcells  are  con- 
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sidered  unknown,  the  closest  one  subcell  can  be  plowed  to  another  before  the 
other  will  have  to  move  is  the  worst-case  layout  rule  in  the  entire  ruleset, 
which  in  our  ruleset  is  8  microns.  Of  course,  if  the  user  wishes  to  overlap  two 
cells,  he  can  still  do  that  using  other  editing  operations  beside  plowing. 

6.  Results  and  experience 

Plowing  has  been  implemented  as  part  of  the  Magic  VLSI  layout  system. 
It  is  written  in  C  under  the  Berkeley  4.2  Unix  operating  system  for  VAXes.  A 
simplified  version  of  plowing  (corresponding  to  that  described  in  Sections  3 
and  4)  has  been  operational  since  October  of  1983. 

While  the  full  implementation  of  plowing  has  not  been  completed,  meas¬ 
urements  on  the  simple  version  indicate  that  it  is  fast  enough  to  be  used 
interactively.  An  example  similar  to  that  presented  in  Figure  la,  consisting  of 
48  parallel  bars  of  polysilicon  each  separated  by  4  microns  (the  minimum 
separation),  took  3.2  seconds  of  VAX-11/780  CPU  time  to  produce  a  result 
similar  to  that  in  Figure  lb.  Only  1.0  seconds  were  spent  computing  the  edge 
motions;  the  remainder  of  the  time  was  spent  in  the  post-pass  which  actually 
updates  the  layout. 
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ABSTRACT 

This  paper  presents  a  new  switchbox  router  developed  as  part  of  the  Magic 
layout  system.  Based  on  Rivest  and  Fiduccia's  "greedy”  channel  router,  the 
Magic  router  is  capable  of  routing  channels  containing  obstacles  such  as  preexist¬ 
ing  wiring.  It  jogs  nets  around  large  obstacles  and  multi-layer  obstacles  such  as 
contacts.  Where  unable  to  avoid  large  single-layer  obstacles,  it  river-routes 
through  them.  It  combines  the  effectiveness  of  traditional  channel  routers  with 
the  flexibility  of  net-at-a-time  routers. 

Keywords  and  Phrases:  channel  routing,  physical  design  aids,  layout,  VLSI. 


1.  Introduction 

Previously  placed  wires  such  as  power  and  ground  routing  form  obstacles  in 
routing  areas.  We  have  developed  a  new  switchbox  router  as  part  of  the  Magic 
layout  system  [OHM],  capable  of  routing  channels  containing  such  obstacles.  The 
router’s  novel  aspect  is  its  ability  to  both  avoid  obstacles  and  consider  interactions 
between  nets  as  channels  are  routed.  It  thus  combines  good  features  from  net-at- 
a-time  routers  and  traditional  channel  routers. 


The  Magic  router  is  an  extension  of  Rivest  and  Fiduccia’s  “greedy”  channel 
router  [RIF].  It  performs  a  column  by  column  scan  of  a  rectangular  routing 
region.  At  each  column  it  applies  a  series  of  rules  controlling  the  placement  of 
vertical  jogs  within  the  column. 

Figure  I  shows  the  solution  to  a  simple  routing  problem.  Figure  2  shows  the 
same  routing  problem  with  an  obstacle  in  the  routing  area.  It  illustrates  some 
basic  principles  of  obstacle  avoidance.  As  the  router  extends  nets  from  left  to 
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Figure  1.  A  simple  channel  routing  problem.  Numbers  around  the  border  of  the 
channel  represent  pins  associated  with  signal  nets.  Pins  with  identical  net  numbers  are 
connected  by  the  router  using  a  left  to  right,  column  by  column  scan. 

right,  it  tries  to  avoid  large  obstacles  in  the  columns  ahead  by  jogging  around 
them.  If  nets  can  not  jog  around  large  single  layer  obstacles  they  river  route 
through  the  obstacles,  switching  layers  if  necessary. 


Figure  2.  The  problem  from  Figure  1  with  obstacles  in  the  channel  (drawn  with 
heavy  outlines).  The  router  tries  to  cross  obstacles  at  narrow  points.  If  necessary  it 
river-routes  through  obstacles. 


Section  2  motivates  the  problem  of  obstacle  avoidance  and  describes  the 
Magic  router’s  goals,  Section  3  summarizes  the  “Greedy”  router,  upon  which  our 
work  is  based.  Section  4  presents  our  solutions  to  a  number  of  problems  encoun- 

V  ' 

tered  in  adapting  the  greedy  channel  router  to  avoid  obstacles.  In  section  5  we 
present  extensions  for  routing  switchboxes.  Section  6  provides  a  detailed  view  of 
the  router.  Section  7  describes  a  channel  splitting  mechanism.  The  paper 


-  2  - 


A  Switchbox  Router  with  Obstacle  Avoidance 


December  7,  1983 


concludes  with  a  discussion  of  the  router’s  implementation  and  performance. 

2.  Motivation 

Automated  routing  systems  typically  divide  the  routing  of  chips  into  three 
steps:  channel  definition,  global  routing,  and  channel  routing.  In  the  channel 
definition  step,  empty  areas  between  cells  are  divided  into  non-overlapping  rec¬ 
tangular  channels.  The  global  routing  step  selects  the  sequence  of  channels 
through  which  each  signal  net  will  be  routed  to  make  the  desired  connections. 
The  channel  routing  step  assigns  physical  locations  to  the  wires  in  each  channel, 
realizing  the  signal  routings  specified  in  the  global  routing  step. 

A  standard  model  for  channel  routing  assumes  a  grid  of  two  independent 
layers  of  minimum  width  wiring.  Horizontal  tracks  are  wired  in  one  of  these 
layers,  while  vertical  columns  are  wired  in  the  other  layer.  Connections  between 
layers  are  made  with  contacts;  where  no  contacts  appear,  layers  may  cross  over 
each  other. 

Routers  generally  assume  that  channels  start  off  completely  free  of  wiring. 
Thus,  it  is  impossible  to  use  an  automatic  router  with  pre-routed  wires.  This  is  a 
serious  limitation  since  certain  signals  such  as  power,  ground,  and  clock  lines  have 
special  restructions  on  width,  layer,  and  length  which  existing  channel  routers  fail 
to  handle. 

Because  channel  routers  do  not  tolerate  the  presence  of  obstacles,  designers 
must  either  accept  inferior  results  generated  by  automatic  routing  systems  or  try 
to  hand  patch  the  router  output.  Hand  patching  is  difficult  because  automatic 
routers  leave  little  room  to  add  wires  or  to  move  wires  to  different  layers.  Also,  if 
the  chip  has  to  be  rerouted,  the  hand  patching  must  be  completely  redone. 

Channel  routing  systems  that  do  handle  obstacles  have  done  so  in  restricted 
ways.  The  PI  system  [Riv]  has  the  notion  of  “covered  channels”  —  areas  wired 
with  metal  for  power  and  ground  routing,  through  which  other  signals  may  be 
river  routed  in  polysilicon.  It  fragments  large  channels  containing  metal  power 
and  ground  wiring  into  many  smaller  covered  and  uncovered  channels  each  of 
which  must  be  routed  individually.  The  problem  of  routing  one  large  channel  is 
thereby  reduced  to  the  problem  of  routing  several  smaller,  more  constrained 
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channels. 

The  BBL  system  [Che],  [CHK]  handles  prewiring  from  a  separate  power  and 
ground  routing  phase.  It  routes  power  and  ground  signals  near  the  edges  of  chan¬ 
nels.  BBL  then  ignores  the  power  and  ground  routing  areas  except  to  bridge 
other  signals  across  them.  It  routes  all  other  signals  using  only  the  clear  parts  of 
the  channels.  BBL  does  not  allow  any  hand  routing. 

Routers  using  maze  [Lee][Hig]  routing  methods  are  able  to  avoid  obstacles  on 
multiple  layers.  The  problem  with  these  routers  is  that  they  consider  only  one  net 
at  a  time.  Since  they  completely  route  a  single  net  before  considering  the  next 
net,  they  cannot  consider  interactions  between  nets  as  a  channel  is  routed.  For 
this  reason  these  routers  are  inferior  to  true  channel  routers  for  channel  routing  of 
general  layouts  [Sou]. 

The  Magic  router  provides  a  general  obstacle  avoidance  capability  that  com¬ 
bines  the  advantages  of  the  above  approaches.  It  allows  designers  to  prewire  criti¬ 
cal  nets,  putting  them  at  any  position  and  in  any  layer.  The  Magic  router  routes 
around  these  prewired  nets. 

The  router  considers  interactions  between  nets.  Routing  decisions  are  based 
on  an  overall  strategy  rather  than  on  a  net-at-a-time  basis.  Considering  tradeoffs 
between  alternatives  improves  the  overall  quality  of  the  resulting  wiring. 

Magic  uses  single-layer  obstructed  areas  to  do  useful  routing.  Since  large, 
loosely  constrained  areas  are  easier  to  route,  it  avoids  fragmenting  these 
obstructed  areas  into  small,  highly  constrained,  hard  to  route  areas. 

Particularly  in  interactive  design  environments  nearly  optimal  results 
obtained  quickly  are  more  useful  than  optimal  results  obtained  after  long  compu¬ 
tation.  Our  router  is  fast,  and  produce  good  results. 

3.  The  Greedy  Roater 

Since  the  greedy  router  algorithm  is  the  starting  point  from  which  our  algo¬ 
rithm  was  developed,  we  start  with  a  brief  overview  of  its  operation.  Three 
features  of  the  greedy  router  are  of  particular  importance.  First,  the  greedy 
router  makes  a  column  by  column  scan  of  the  routing  area.  It  completely  wires 
the  current  column  before  extending  active  tracks  into  the  next  column.  Second, 
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it  uses  a  list  of  rules  to  control  the  placement  of  vertical  wiring  in  a  column. 
Rules  are  applied  in  order  of  importance,  to  a)  avoid  “getting  stuck”;  and  b)  to 
make  subsequent  columns  easier  to  route  (Figure  3).  The  list  of  rules  can  easily 
be  modified.  Third,  unlike  constraint  graph  approaches,  the  greedy  router  allows 
split  nets,  nets  that  occupy  more  than  one  track  at  a  time.  Split  nets  give  the 
router  the  flexibility  to  evaluate  alternatives  and  choose  the  one  that  is  best  for 
the  overall  routing  problem. 

Column  wiring  begins  by  bringing  the  nets  of  a  column’s  top  and  bottom  pins 
(if  any)  into  the  first  tracks  that  are  either  vacant  or  already  assigned  to  the  nets. 
Deferring  this  to  a  later  step  might  allow  vertical  wiring  to  block  a  net,  prevent¬ 
ing  it  from  being  brought  into  a  vacant  track. 


2 
2 
1 
3 

3 

Figure  3.  Three  columns  wired  by  the  greedy  router.  In  the  first  column  net  2  makes 
a  collapsing  jog  and  net  3  makes  a  falling  jog.  In  the  second  column  net  4  enters  the 
channel,  preventing  net  1  from  making  a  collapsing  jog;  however,  net  l's  lower  track 
makes  a  jog  to  reduce  the  range  of  tracks  assigned  to  this  split  net. 

Bringing  a  net  into  the  first  available  track  may  leave  it  split  on  multiple 
tracks.  Split  nets  can  fill  up  the  channel,  making  it  impossible  to  bring  in  addi¬ 
tional  nets.  The  greedy  router  thus  makes  collapsing  split  nets  its  next  priority. 
Since  conflicting  vertical  wiring  can  make  it  impossible  to  collapse  all  split  nets  in 
a  particular  column,  the  router  collapses  split  nets  in  the  pattern  that  frees  up  the 
most  empty  tracks  for  use  in  the  next  column. 

Vertical  wiring  conflicts  may  prevent  the  router  from  collapsing  all  split  nets. 
The  router  simplifies  the  routing  of  these  remaining  split  nets  by  reducing  the 
range  of  tracks  occupied  by  these  nets.  It  jogs  each  split  net’s  highest  occupied 
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track  downward  and  its  lowest  occupied  track  upwards.  The  remaining  problem 
is  easier  because  collapsing  can  be  done  with  shorter  jogs. 

Next,  unsplit  rising  and  falling  nets  are  jogged  upward  or  downward  toward 
the  edge  of  the  channel  with  their  next  pin.  This  step  anticipates  the  split  nets 
that  will  be  created  when  upcoming  pins’  nets  are  brought  into  the  channel.  It 
attempts  to  reduce  the  range  of  these  split  nets  before  they  are  created.  This  step 
prevents  split  nets  if  the  rising  or  falling  net  can  be  jogged  into  what  would  other¬ 
wise  be  the  first  vacant  track  seen  by  a  net  as  it  enters  the  channel  from  a  top  or 
bottom  pin. 

The  handling  of  split  nets  and  rising  and  falling  nets  are  examples  of  deci¬ 
sions  based  on  interactions  between  nets.  Among  conflicting  alternatives  (a  jog  to 
raise  a  rising  net  may  block  a  jog  to  lower  a  falling  net)  the  router  chooses  the 
one  that  does  the  best  job  of  simplifying  the  remaining  overall  problem. 

4.  Extending  the  Greedy  Ronter 

In  modifying  the  greedy  router  to  avoid  obstacles  we  had  to  solve  a  number 
of  problems.  The  result  was  an  augmented  set  of  rules  for  placing  horizontal  and 
vertical  wiring.  In  the  following  discussion,  an  area  with  a  single  layer  obstacle  is 
called  an  obstructed  area.  The  Magic  router  river-routes  through  obstructed 
areas.  An  area  is  blocked  if  it  contains  a  double  layer  obstacle.  No  routing  may 
pass  through  blocked  areas. 

As  it  scans  a  channel  from  left  to  right,  the  greedy  router  expects  that  it  can 
always  extend  a  track  into  the  next  column  if  necessary.  The  router  must  avoid 
extending  tracks  into  blocked  areas  (Figure  4). 

We  solve  this  problem  by  anticipating  upcoming  obstacles  and  attempting  to 
jog  nets  out  of  their  way.  We  do  this  by  identifying  areas  near  obstacles;  these 
areas  are  called  obstacle  thresholds.  A  preprocessing  step  searches  the  routing 
area,  marking  obstacle  thresholds.  Tracks  extending  into  these  marked  areas 
make  vacating  jogs  to  tracks  outside  these  areas. 

Another  important  issue  is  the  tradeoff  between  horizontal  and  vertical  wir¬ 
ing.  Magic  has  to  decide  whether  to  route  horizontal  wires  or  vertical  wires  over 
single  layer  obstacles.  It  can  not  do  both  of  these,  since  an  obstacle  and  a  wire 
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Figure  4.  Tracks  can  not  extend  into  blocked  areas  (drawn  in  dotted  lines).  Note 
that  two  adjacent  areas  of  different  layers  (B)  form  blocks  because  there  is  no  place  to 
put  contacts  to  bridge  from  one  area  to  the  other. 

crossing  it  block  both  routing  layers.  A  thin  vertical  wire  should  be  bridged  hor¬ 
izontally  by  tracks.  Likewise,  a  thin  horizontal  wire  should  be  bridged  vertically 
by  columns.  Intermediate  cases  are  harder  to  classify  (Figure  5). 


M  (b) 

Flgur*  5.  Thin  width  vertical  wires  should  be  bridged  horizontally  by  tracks  (a). 
Thin  width  horizontal  wires  should  be  bridged  vertically  by  columns  (b).  Intermediate 
cases  are  harder  to  classify. 

We  solve  this  problem  by  always  giving  priority  to  horizontal  wiring.  If  vert¬ 
ical  wiring  is  not  done  in  the  current  column  it  may  be  done  in  some  later 
column.  Horizontal  wiring  is  more  important:  if  the  router  needs  to  extend 
tracks  but  can  not,  it  fails. 

Although  horizontal  wiring  gets  priority  over  vertical  wiring,  we  attempt  to 
avoid  extending  tracks  into  large  single  layer  obstacles.  When  tracks  do  extend 
into  single  layer  obstacles  the  Magic  router  tries  to  jog  them  out  of  these  areas, 
into  unobstructed  tracks.  It  is  important  to  do  this  because  a  single  track  running 
through  an  obstructed  area  blocks  all  columns  that  might  cross  the  obstructed 
area  (Figure  6). 
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Figure  ft.  Nets  avoid  obstructed  tracks  wherever  possible.  Failure  to  do  so  may 
create  blocked  areas.  Since  net  2  is  in  an  obstructed  area,  net  3  is  forced  to  make  a 
long  detour. 

The  greedy  router  assumes  that  it  can  make  vertical  column  wiring  anywhere 
the  channel  is  not  blocked  by  vertical  wiring  it  previously  placed.  The  Magic 
router  has  to  know  not  only  when  to  place  vertical  column  wiring,  but  also  how  to 
do  this.  It  has  to  know  when  areas  are  blocked,  and  when  to  place  contacts  to 
switch  layers. 

Given  our  wiring  model,  contact  placement  is  simple.  If  a  contact  needs  to 
be  placed  to  allow  a  layer  switch,  there  is  only  one  place  where  that  contact  can 
go:  immediately  adjacent  to  the  obstacle.  For  vertical  wiring  contacts  may  be 
placed  immediately  above  or  below  the  obstacle.  For  horizontal  wiring  the  loca¬ 
tions  are  immediately  to  the  left  and  right  of  the  obstacle. 

Our  wiring  model  allows  horizontal  and  vertical  wiring  in  either  layer;  how¬ 
ever,  only  one  layer  of  horizontal  wiring  and  one  layer  of  vertical  wiring  is 
allowed  at  any  point.  There  is  a  preferred  layer  in  each  direction:  horizontal 
tracks  and  vertical  column  wires  may  run  in  the  opposite  layer  only  to  bridge  an 
obstacle.  Since  poly  is  the  preferred  vertical  layer,  a  vertical  run  may  bridge  a 
metal  obstacle  without  placing  contacts,  but  contacts  need  to  be  placed  to  bridge 
a  poly  obstacle.  If  the  track  immediately  above  the  poly  obstacle  is  vacant,  then 
the  contact  can  be  placed.  If  the  track  is  occupied  by  horizontal  wiring,  the  pre¬ 
ferred  layer  policy  says  that  it  must  be  in  metal.  The  metal/poly  boundary 
blocks  the  vertical  run,  since  there  is  no  space  to  bridge  the  metal  track  in  poly 
and  place  a  contact  before  running  over  the  poly  obstacle  (Figure  7). 

The  greedy  router  assumes  that  channels  can  be  arbitrarily  expanded  and 
that  terminals  on  the  left  and  right  edges  of  the  channel  can  “float”  up  and  down 
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Figure  7.  The  outlined  areas  (A)  above  and  below  the  obstacle  (B)  are  reserved  for 
column  contacts  necessary  if  the  obstacle  is  to  be  bridged  vertically.  The  router  tries 
to  keep  the  areas  clear  of  wiring.  Note  that  the  horizontal  metal  run  prevents  both 
the  poly  (2)  and  the  metal  (3)  vertical  runs  from  bridging  the  obstacle,  because  there  is 
no  room  to  place  contacts. 

as  long  as  their  relative  positions  remain  the  same.  Tracks  may  be  inserted  wher¬ 
ever  the  router  gets  “stuck”.  The  Magic  router  assumes  that  channels  have  a 
fixed  number  of  tracks  and  that  terminals  have  fixed  positions  on  the  edges  of  the 
channels.  Accordingly,  the  Magic  router  omits  the  greedy  router’s  channel  widen¬ 
ing  step,  reporting  failure  if  a  net  could  not  be  brought  into  the  channel  from 
some  top  or  bottom  pin. 

5.  Routing  Switchboxes 

The  greedy  channel  router  handles  pins  on  at  most  the  top,  left,  and  bottom 
sides  of  a  channel.  To  make  it  a  switchbox  router,  the  Magic  router  contains 
additional  rules  to  make  connections  on  the  right  edge  of  the  channel.  Further¬ 
more,  the  Magic  router  removes  the  assumption  that  nets  have  at  most  one  pin  on 
each  end  of  the  channel. 

The  Magic  router  deals  with  switchbox  connections  by  introducing  the  notion 
of  reserved  tracks.  A  track  is  reserved  if  it  is  needed  by  some  net  to  make  a  con¬ 
nection  on  the  right  edge  of  the  channel.  When  approaching  the  end  of  the  chan¬ 
nel  the  router  makes  vacating  jogs  to  clear  reserved  tracks  and  then  jogs  the 
appropriate  nets  into  these  tracks  (Figure  8).  Additionally,  after  nets  with  only 
one  right  edge  pin  have  made  their  last  top  and  bottom  pin  connections,  their 
right  edge  tracks  become  reserved,  other  nets  vacate  these  tracks,  and  the  router 
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Figure  8.  The  outlined  areas  are  reserved  for  nets  making  connections  at  the  end  of 
the  channel.  Any  other  nets  entering  these  areas  make  vacating  jogs,  allowing  the  re* 
quired  nets  to  occupy  the  tracks. 

tries  to  jog  nets  into  their  final  tracks.  Vacating  reserved  tracks  uses  the  same 
mechanism  provided  to  vacate  obstructed  tracks. 

If  a  net  has  more  than  one  pin  on  the  right  edge  of  the  channel,  the  router 
needs  to  split  the  net  to  connect  to  them.  Split  nets  occupy  tracks  that  could  oth¬ 
erwise  be  used  to  help  route  the  channel.  Therefore  splitting  to  make  multiple 
end  connections  is  only  done  when  the  router  gets  close  to  the  end  of  the  channel. 
Close,  is  a  parameter  the  user  sets  to  control  net  splitting.  A  typical  value  is  two 
columns. 


Figure  9.  As  the  router  approaches  the  end  of  the  channel,  nets  with  all  of  their  pins 
on  the  right  edge  of  the  channel  require  tracks  to  be  assigned  to  the  nets.  This  is  done 
if  at  least  two  tracks  can  be  allocated  and  joined  with  vertical  wiring. 

Nets  with  all  of  their  pins  on  the  right  edge  edge  of  the  channel  are  another 
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complication.  As  the  router  nears  the  right  edge  of  the  channel  it  has  to  decide 
when  to  first  assign  tracks  to  these  right  edge  nets.  Since  there  are  no  connections 
to  previous  pins,  a  right  edge  net  is  introduced  into  the  channel  only  if  it  can  be 
assigned  to  at  least  two  tracks  that  can  be  joined  by  vertical  wiring  (Figure  9). 

We  carry  this  one  step  further.  Groups  of  two  or  more  tracks  for  a  particu¬ 
lar  right  edge  net  may  be  introduced  into  the  channel,  even  if  the.  groups  them¬ 
selves  can  not  immediately  be  joined.  The  task  of  joining  these  groups  is  easier, 
since  the  top  track  of  one  group  need  only  be  connected  to  the  bottom  track  of  a 
net’s  higher  group. 

8.  The  Magic  Routing  Algorithm 

The  Magic  router  operates  in  three  phases.  It  begins  by  making  a  pre¬ 
routing  scan  of  the  routing  area,  identifying  obstacle  thresholds.  After  identifying 
obstacle  thresholds,  the  router  extends  nets  from  left  edge  pins  into  the  routing 
area  and  routes  it  using  the  column-by-column  scan.  After  routing  the  channel 
the  Magic  router  invokes  a  post  processing  step  to  maximize  metal  and  reduce 
vias. 

8.1.  Finding  Obstacle  Thresholds 

Obstacle  thresholds  are  generated  for  all  multi-layer  obstacles  and  some  sin¬ 
gle  layer  obstacles.  Multi-layer  obstacles  such  as  contacts,  crossings,  and 
poly/metal  edges  must  always  be  avoided  as  tracks  extend  from  left  to  right,  since 
it  is  not  possible  to  bridge  these  obstacles  in  any  layer.  Single  layer  obstacles 
extending  horizontally  for  more  than  one  column’s  width  also  generate  threshold 
areas.  Single  layer  obstacles  extending  horizontally  for  only  one  column’s  width 
do  not  generate  thresholds  since  the  vertical  wiring  gained  in  the  obstructed  area 
is  offset  by  the  vertical  wiring  wasted  in  jogging  around  the  obstacle. 

Depending  on  the  height  of  the  obstacle,  many  nets  may  have  to  be  jogged 
around  it.  Not  ail  nets  can  make  vacating  jogs  in  the  same  column  because  the 
vertical  wiring  for  one  vacating  jog  blocks  another  net  from  making  its  vacating 
jog.  On  the  other  hand,  vacating  tracks  long  before  they  near  obstacles  wastes 
channel  routing  area.  In  recognition  of  this,  the  Magic  router  makes  vacating  jogs 
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around  an  obstacle  depending  on  how  far  away  and  how  high  the  obstacle  is. 
Higher  obstacles,  which  block  more  tracks,  cause  nets  to  start  vacating  jogs 
farther  away,  while  shorter  obstacles  can  be  approached  more  closely  before 
vacating  jogs  commence.  The  width  of  the  threshold  is  the  product  of  a  parame¬ 
ter,  obstacle  threshold  constant,  and  the  height  of  the  obstacle.  This  parameter 
allows  some  control  over  how  soon  the  router  attempts  to  vacate  obstructed 
tracks.  A  typical  value  for  this  parameter  is  1. 


Figure  t.  Taller  obstacles  may  require  more  nets  to  vacate  their  thresholds;  therefore 
taller  regions  have  wider  thresholds. 

The  obstacle  threshold  also  extends  one  track  above  and  below  the  obstacle. 
Nets  do  not  get  assigned  to  these  tracks  unless  no  other  track  is  free.  This  allows 
contacts  to  be  placed  if  vertical  wiring  has  to  switch  layers  to  bridge  the  obstacle 
(Figure  7). 

6.2.  Wiring  Rules 

This  section  presents  the  set  of  rules  the  Magic  router  uses  to  control  the 
placement  of  contacts  and  vertical  jogs.  The  following  discussion  omits  details 
that  are  identical  in  the  gTeedy  router.  The  rules  are: 

a.  Place  Track  Contacts :  As  the  first  step  in  wiring  a  column,  place  a  contact 
in  each  unobstructed  track,  if  either  the  next  column  or  the  previous  column 
has  an  obstruction  in  the  preferred  horizontal  track  layer.  The  contact 
serves  one  of  three  purposes:  (a)  it  switches  the  net  from  the  preferred 
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horizontal  track  layer  (metal)  to  the  alternate  layer  (poly)  when  the  net 
enters  a  river-routed  region;  (b)  it  switches  the  net  from  the  alternate  layer 
back  to  the  preferred  horizontal  layer  when  the  net  leaves  a  river-routed 
region;  or  (c)  it  switches  the  track  to  the  preferred  vertical  layer  in  prepara¬ 
tion  for  jogging  the  net  to  another  track. 

b.  Make  Minimal  Top  and  Bottom  Connections :  Do  not  bring  a  net  into  an 
unobstructed  track  that  is  blocked  in  the  next  column.  This  step  may  bring 
a  net  into  an  obstructed  track.  If  this  occurs,  step  (f)  will  attempt  to  jog  the 
net  to  an  unobstructed  tracks.  Report  failure  if  some  net  could  not  be 
brought  into  the  channel. 

c.  Collapse  Split  Nets. 

d.  Reduce  the  Range  of  Tracks  Assigned  to  Split  Nets :  Do  not  move  a  net  from 
a  free  track  to  a  track  that  needs  to  be  vacated. 

e.  Raise  Rising  Nets  and  Lower  Falling  Nets:  Do  not  jog  from  a  free  track  to 
one  that  needs  to  be  vacated. 

f.  Vacate  Obstructed  Tracks:  Identify  tracks  from  which  nets  should  be 
vacated.  These  are  tracks  which  are  either  in  the  threshold  of  an  obstacle  or 
are  reserved  to  make  some  end  connection.  Try  to  vacate  to  the  nearest 
empty,  unobstructed  track.  Do  not  vacate  to  another  obstructed  or  reserved 
track  unless  the  source  track  is  blocked  (ie.  runs  into  a  multi-layer  obstacle) 
and  the  destination  track  is  not  blocked.  Give  preference  to  vacating  jogs 
that  move  rising  and  falling  nets  closer  to  their  next  pin. 

g.  Split  Nets  to  Make  Multiple  End  Connections:  If  within  channel  end  con¬ 
stant  columns  of  the  end  of  the  channel,  attempt  to  split  nets  to  make  multi¬ 
ple  connections  at  the  end  of  the  channel.  This  is  the  opposite  of  the  collaps¬ 
ing  step  c  above.  The  best  pattern  is  that  which  splits  the  most  tracks. 

h.  Extend  Active  Tracks  to  the  Next  Column:  Report  an  error  if  some  track  is 
prevented  from  extending  into  the  next  column  by  the  presence  of  a  multi¬ 
layer  obstacle  that  could  not  be  avoided. 
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8.3.  Metal  Maximization 
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Figure  11.  A  postprocessing  step  maximizes  metal.  This  may  delete  or  move  vias. 
It  may  also  introduce  vias. 


After  the  subchannels  are  routed,  the  Magic  router  concludes  with  a  metal 
maximization  step.  (Figure  11).  Since  the  router  already  routes  metal  horizon¬ 
tally  wherever  possible,  this  step  replaces  vertical  wiring  in  polysilicon  with  verti¬ 
cal  wiring  in  metal,  subject  to  constraints  imposed  by  obstacles  in  the  channel. 
Vias  are  deleted  wherever  they  become  unnecessary. 


7.  Channel  Splitting 

The  Magic  router  also  extends  the  greedy  router  by  including  a  channel  split¬ 
ting  feature.  It  splits  a  channel  in  two  at  a  point  of  maximum  density,  assigns 
tracks  to  nets  crossing  the  split,  then  routes  both  subchannels  outwards  from  the 
column  of  the  split.  The  intent  of  channel  splitting  is  to  improve  the  routability 
of  the  two  resulting  subproblems  by  (1)  assigning  tracks  to  the  nets  crossing  the 
split  to  remove  conflicts  between  vertical  wiring,  and  (2)  removing  split  nets  at 
the  column  where  the  channel  is  divided,  to  guarantee  that  there  are  enough 
available  tracks  to  accommodate  the  nets  that  must  cross  this  column  Channel 
splitting  is  done  if  the  length  in  columns  of  each  of  the  resulting  subproblems  is 
greater  than  or  equal  to  the  parameter  minimum  channel  size ,  and  if  the  density 
of  the  routing  problem  is  close  to  the  size  of  the  channel.  If  the  channel  can  not 
be  split,  then  the  router  routes  it  from  left  to  right  or  from  right  to  left,  at  the 
discretion  of  the  user. 
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Figure  12.  To  increase  the  routability  of  the  two  subchannels  the  router  assigns 
tracks  to  nets  crossing  the  split.  Nets  are  ordered  based  on  their  rising/falling  status 
and  the  distance  to  their  closest  left  and  right  pins. 

Channel  splitting  is  not  recursive  —  it  is  done  at  most  once.  The  idea  is  to 
route  away  from  the  point  of  maximum  density.  Splitting  each  subchannel  at  its 
point  of  maximum  density  would  result  in  subchannels  routing  from  one  highly 
constrained  region  to  another. 

After  deciding  where  to  split  the  channel,  the  Magic  router  assigns  tracks  to 
the  nets  crossing  the  split.  The  ranking  procedure  assigns  each  net  a  ranking 
number  which  is  the  average  of  the  distance  from  the  center  track  of  the  channel 
to  the  net’s  target  tracks  in  the  left  and  right  subchannels.  The  top  tracks  go  to 
nets  which  rise  to  pins  on  the  top  edge  of  both  subchannels.  The  bottom  tracks 
are  assigned  to  nets  which  fall  to  pins  on  the  bottom  edge  of  both  subchannels. 
All  other  nets,  including  those  rising  or  falling  an  intermediate  distance,  and  those 
steady  in  both  subchannels,  get  distributed  between  the  first  two  groups. 

Another  discriminator  is  used  among  nets  rising  to  the  top  or  falling  to  the 
bottom  of  both  subchannels.  A  net  a  ranks  above  another  net  b  if  both  a’s 
nearest  left  pin  and  its  nearest  right  pin  are  closer  to  the  split  column  than  b's 
corresponding  pins.  If  the  distances  overlap  (ie.  a’s  left  pin  is  closer  than  6’s,  and 
6’s  right  pin  is  closer  than  a’s),  then  the  net  with  the  smaller  sum  of  distances  is 
placed  above  the  other.  A  similar  procedure  is  used  for  falling  nets.  The  intent  is 
to  order  the  nets  to  eliminate  crossings  wherever  possible.  If  nets  must  cross,  this 
procedure  favors  the  net  traveling  the  shorter  distance. 
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8.  Implementation  and  Performance 

For  channels  without  obstacles  the  Magic  router  produces  results  similar  to 
those  produced  by  other  good  channel  routers  such  as  the  hierarchical  router 
[BuP],  the  greedy  router  [RiF],  and  Algorithm  #2  [YoK].  In  spite  of  omitting  the 
track  insertion  step  from  the  greedy  algorithm,  it  routes  Deutsch’s  difficult  in  the 
same  number  of  tracks  as  the  the  greedy  router.  The  results  are  summarized  in 
Table  1. 


Router 

Tracks 

Vias 

Wire  Length 

Time 

(sec) 

Machine 

Magic 

(no  obstacles) 

20 

376 

4099 

1.5 

DEC  VAX  11/780 

Magic 

(with  obstacles) 

20 

376 

4099 

3.0 

DEC  VAX  11/780 

Algorithm  #2 

20 

- 

2.1 

DEC  VAX  11/780 

Greedy 

20 

347 

4150 

7.93 

DEC  KA-10 

Hierarchical 

19 

270 

3983 

24 

IBM  370/3033 

Table  1.  Router  Results  for  Deutsch’s  Difficult  Example 


Most  of  the  numbers  in  Table  1  were  taken  from  [BuP].  The  first  table  entry 
refers  to  our  implementation  of  a  modified  greedy  switchbox  router  before  obsta¬ 
cle  avoidance  was  added.  The  reported  number  of  vias  for  the  Magic  router  does 
not  show  the  results  of  metal  maximization. 

The  table  shows  that  the  Magic  router  is  competitive  with  other  channel 
routers  on  conventional  routing  problems.  It  produces  nearly  optimal  solutions 
quickly,  which  may  be  more  valuable  in  practice  than  programs  such  as  the 
Hierarchical  router  which  produce  slightly  better  results  after  significantly  greater 
computation.  Adding  obstacle  avoidance  nearly  doubled  the  running  time  of  our 
router. 

Our  figures  provide  a  good  comparison  between  Yoshimura  and  Kuh’s  Algo¬ 
rithm  #2  and  Rivest  and  Fiduccia’s  greedy  router.  Rivest  and  Fiduccia’s  router 
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was  implemented  in  LISP  on  a  KA-10.  The  Magic  router  without  obstacle 
avoidance  (which  is  almost  identical  to  the  greedy  router)  is  implemented  in  the  C 
programming  language.  Algorithm  #2  is  implemented  in  FORTRAN.  Both  the 
Magic  router  (without  obstacle  avoidance)  and  Algorithm  #2  run  on  VAX 
ll/780s  running  Berkeley  Unix.  The  early  version  of  our  router  runs  faster  than 
the  already  fast  Algorithm  #2,  and  produces  a  result  using  the  same  number  of 
tracks. 

Experience  with  channel  splitting  has  so  far  been  disappointing.  It  has 
turned  out  to  be  useful  mostly  for  assigning  crossings  in  river  routed  regions.  In 
other  cases  splitting  the  channel  typically  increases  the  number  of  tracks  required 
to  route  the  channel.  Better  rules  for  ordering  the  nets  crossing  the  boundary 
between  the  subchannels  might  change  this.  Another  idea  would  be  to  use  dif¬ 
ferent  criteria  to  decide  where  to  split  the  channel. 


Figure  13.  The  Magic  router  river-routes  in  areas  completely  blocked  in  a  single 
layer. 

As  an  example  of  the  range  of  problems  handled  by  the  Magic  router,  Figure 
13  shows  a  channel  completely  covered  with  metal.  Our  router  does  a  reasonable 
job  of  routing  this  problem. 

Postprocessing  to  increase  metal  and  remove  vias  appears  to  significantly 
improve  the  quality  of  the  routing. 
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9.  Conclusions 

Our  obstacle  avoiding  channel  router  adds  flexibility  to  our  design  environ¬ 
ment.  It  allows  designers  to  route  critical  signals  by  hand  or  with  separate  rout¬ 
ing  steps.  After  critical  signals  are  routed,  the  router  makes  the  remaining  con¬ 
nections. 

The  Magic  channel  router  provides  this  obstacle  avoiding  capability,  while 
also  considering  tradeoffs  and  interactions  between  nets.  It  accomplishes  this 
using  a  rule  based,  column  sweep  routing  algorithm  which  is  simple,  flexible,  and 
fast.  The  simplicity  of  this  approach  makes  it  an  attractive  vehicle  for  further 
experimentation. 
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ABSTRACT 

The  Magic  VLSI  layout  editor  contains  an  incremental 
design-rule  checker.  When  the  circuit  is  changed,  only 
the  modified  areas  are  rechecked.  The  checker  runs  con¬ 
tinuously  in  background  to  keep  information  about 
design-rule  violations  up-to-date.  This  paper  describes 
the  basic  rule  checker,  which  operates  on  edges  in  the 
layout,  and  the  techniques  used  to  perform  incremental 
checking  on  hierarchical  designs. 

Keywords  and  Phrases:  design-rule  cheeking,  interac¬ 
tive  layout  editor 


1.  Introduction 

Almost  all  existing  design-rule  checking  (DRC)  programs 
are  batch  oriented  [l]  [2].  They  read  in  a  complete  circuit  lay¬ 
out  and  check  the  entire  design.  If  the  circuit  is  changed,  the 
only  way  to  find  out  whether  design  rules  have  been  violated  is 
to  recheck  the  entire  design,  no  matter  how  small  the  change 
or  how  large  the  design.  For  chips  with  tens  of  thousands  of 
transistors,  a  batch  DRC  run  may  require  many  hours  of  com¬ 
puter  time. 

This  paper  describes  a  different  approach  to  design-rule 
checking.  As  part  of  the  Magic  VLSI  layout  editor  [3],  we  have 
built  a  checker  thai  operates  intrtmtntally.  When  the  layout 
is  modified,  Magic  records  which  areas  have  changed  and 
rechecks  only  those  areas.  While  the  user  continues  editing, 
the  checker  runs  in  background  and  highlights  errors  as  it  finds 
them.  There  is  no  set-up  time  because  the  checker  works  from 
the  same  data  structure  used  to  represent  the  layout.  Since 
most  changes  made  with  the  interactive  editor  are  small  and 
the  checker  is  fast,  it  can  usually  display  errors  instantly. 

The  user’s  view  of  design-rule  checking  is  a  simple  one. 
As  he  edits  the  circuit,  small  white  dots  appear  over  areas  that 
contain  layout  errors.  As  soon  as  the  errors  are  fixed,  the 
white  dots  go  away.  Error  informaton  b  stored  with  the 
design  and  will  reappear  during  the  next  editing  session  if  the 
violation  has  not  been  fixed.  Thb  information  b  always  kept 
up-to-date,  so  there  b  never  any  need  to  run  a  batch  checker. 

In  the  next  section,  we  describe  Magic's  internal  represen¬ 
tation  for  a  layout  and  explain  how  particular  features  contri¬ 
bute  to  fast  incremental  checking.  Section  3  describes  how  the 
basic  checker  works  from  edges  in  the  layout  and  how  design 


rules  are  specified.  Section  4  shows  how  we  use  the  basic 
checker  for  incremental  checking  of  individual  cells,  and  Sec¬ 
tion  5  describes  how  hierarchical  designs  are  handled.  Section 
0  gives  measurements  of  the  checker's  speed. 

2.  Representation  of  a  Layout 

In  Magic,  a  layout  b  represented  as  a  hierarchical  collec¬ 
tion  of  cells.  Each  cell  contains  mask  information  plus  pointers 
to  subcells.  For  now,  we  will  consider  only  a  single  cell  at  a 
time  (Section  5  generalites  the  solution  to  handle  hierarchical 
designs). 

Magic  represents  the  mask  layers  of  a  cell  with  rectangu¬ 
lar  ti/es,  which  means  that  it  handles  only  Manhattan 
geometries.  Each  tile  indicates  the  type  of  mask  layer  it 
represents.  Tiles  are  connected  to  form  p/snes  by  a  technique 
called  eorner-sfifeAiny  (4|  illustrated  in  Figure  I.  The  tiles  in 
a  plane  are  non-overlapping  and  cover  it  completely.  Empty 
areas  are  covered  with  tiles  of  type  “space." 

Each  cell  contains  several  planes  of  mask  information. 
Mask  types  that  interact  (such  as  polysilicon  and  diffusion)  are 
stored  together  in  the  same  plane,  while  those  that  do  not 
interact  (such  as  polysilicon  and  metal)  are  stored  in  different 
planes.  Contacts  between  uask  types  on  different  planes  are 
represented  in  both  of  them.  Our  nMOS  process  has  two 
planes:  one  for  metal  and  one  for  polysilieon,  diffusion,  and 
transistors. 


Figure  1.  As  example  of  a  corser-stitched  plane.  Each 
plane  contain*  tile*  of  different  type*  that  cover  the  entire 
area  of  the  plane  (ipace  tile*  are  n*ed  where  there  is  no 
mask  material).  Each  tile  contain*  four  pouters  that  link 
it  to  neighboring  tile*  at  it*  corner*.  The  pouter*  make  it 
ea*y  to  tad  all  the  material  in  a  given  area. 


Instead  of  working  directly  with  physical  mask  layers, 
Magic  uses  attract  layer*  to  represent  structures  such  as 
transistors  and  contact*.  The  abstract  layers  appear  in  the 
database  as  tiles  with  special  types.  For  example,  instead  of 
representing  an  enhancement  transistor  as  a  polysilicon  tile 
over  a  diffusion  tile,  it  is  represented  with  a  tile  of  type 
“enhancement  transistor.’’  A  more  complete  explanation  of  the 
abstract  layers  is  given  in  [3],  What  matters  here  is  that  all 
the  interesting  features  are  represented  explicitly:  there  is  no 
need  to  cross-register  diffusion  and  polysilicon  to  discover  the 
transistors. 

The  design-rule  checker  takes  advantage  of  Magic's  data¬ 
base  in  three  ways.  First,  the  corner-stitched  tiles  allow  DRC 
to  find  material  in  a  given  area  very  quickly.  Second,  division 
of  mask  information  into  planes  allows  the  checker  to  work 
with  one  plane  at  a  time,  ignoring  irrelevant  geometry  on  other 
planes.  Third,  there  is  no  need  to  extract  features  by  register¬ 
ing  layers:  the  abstract  layers  represent  the  important 
features  explicitly.  Because  of  these  features,  there  is  no  need 
for  the  checker  to  manage  a  separate  structure  of  its  own:  it 
works  directly  from  the  layout  database. 

2.  The  Basle  Checker 

This  section  describes  the  basic  design-rule  checking  para¬ 
digm  used  to  validate  an  area  of  a  tingle  corner-stitched  plane. 
Later  sections  show  how  this  basic  checker  is  used  to  perform 
incremental  checks  on  a  single  cell,  and  then  on  a  hierarchy  of 
cells. 

2.1.  Edge-based  Rules 

Magic's  design  rules  are  based  on  edges  between  tiles. 
Each  rule  can  be  applied  in  any  of  four  directions,  as  shown  in 
Figure  2.  The  rule  table  contains  a  separate  list  of  rules  for 
each  possible  combination  of  materials  on  the  two  sides  of  an 
edge.  In  its  simplest  form,  a  rule  specifies  a  distance  and  a  set 
of  mask  types:  only  the  given  types  are  permitted  within  that 
distance  on  typer’s  side  of  the  edge.  This  area  is  referred  to  as 
the  conitraint  region. 


Oaly  certaia  tile  types  are  allowed 
ia  the  dashed  constraint  region*. 


*~4~*  —4  — 

Figure  2.  Design  rales  are  applied  at  the  edges  between 
tiles  ia  the  tame  plaae,  A  rule  it  specified  ia  terms  of  fypei 
tad  type*,  the  materials  on  either  side  of  the  edge.  Each 
rule  may  be  applied  ia  say  of  four  directions,  u  shown  hy 
the  arrows.  The  simplest  rules  require  that  only  certaia 
mask  types  can  appear  withia  distance  4  on  typefs  ide  of 
the  edge. 


tile  types  allowed: 
anything  but  poly 


poly 
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(a) 


constraint 

regions 

(b) 


Figure  2.  If  only  the  •  mi  pie  rules  from  Figure  2  are  used, 
errors  may  go  unnoticed  ia  corner  regions.  For  example, 
the  polysilicon  spacing  rule  ia  (a)  will  fail  to  detect  the  er¬ 
ror  ia  (b). 


Unfortunately,  this  simple  scheme  will  miss  errors  in 
corner  regions,  such  as  the  case  shown  in  Figure  3.  To  elim¬ 
inate  these  problems,  the  full  rule  format  allows  the  constraint 
region  to  be  extended  past  the  ends  of  the  edge  under  some  cir¬ 
cumstances.  See  Figure  4  for  an  illustration  of  the  corner  rules 
and  how  they  work.  Table  1  gives  a  complete  summary  of  the 
information  in  each  design  rule. 


(<)  (<■) 


Figure  4.  The  complete  design  rule  format  is  illustrated 
■a  (a).  Whenever  aa  edge  ha*  type!  on  it*  left  tide  aad 
typef  oa  its  right  tide,  the  area  A  it  checked  to  be  sure  that 
only  type*  allowed  are  present  If  the  material  jntt  above 
aad  to  the  left  of  the  edge  it  one  of  comer  type*,  then  area 
B  it  alto  checked  to  be  sure  that  it  contains  only  type*  al¬ 
lowed  A  similar  corner  check  it  made  at  the  bottom  of  the 
edge.  Figure  (b)  thowt  a  polytilicoa  spacing  rule,  (c)  shows 
a  situation  where  corner  extension  it  performed  oa  both 
end*  of  the  edge,  and  (d)  thowt  a  situation  where  corner  ex¬ 
tension  is  made  only  at  the  bottom  of  the  edge 


Parameter 

Meaning 

lyptl 

Material  on  first  side  of  edge. 

type2 

Material  on  second  side  of  edge. 

d 

Distance  to  check  on  second  side  of  edge. 

laytrt 

allowed 

List  of  layers  that  are  permitted 
within  d  units  on  second  side  of  edge. 

comer 

types 

List  of  layers  that  cause  corner  extension. 

comer 

extension 

Amount  to  extend  constraint  area 
when  corner  type  matches. 

Ttbli  1.  The  parte  of  aa  edge-baaed  rale. 


3.2.  Applying  the  Rule: 

To  cheek  a  portion  of  a  single  plane,  Magic  must  first  find 
all  of  the  edges  in  that  area.  Thu  is  accomplished  by  searching 
for  all  of  the  tiles  in  the  area.  The  corner-stitched  data  struc¬ 
ture  is  well  suited  to  searches  of  this  tort:  see  (4|.  For  each 
tile,  the  checker  examines  its  left  and  bottom  sides  (the  top 
and  right  sides  of  the  tile  will  be  checked  hy  the  neighbors  on 
those  sides).  Since  the  tile  may  have  neighbors  of  different 
types  on  the  same  side,  the  checker  searches  through  all  the 
neighbors  to  divide  the  side  of  the  tile  into  edges  with  a  single 
material  on  each  side. 

To  process  an  edge,  the  mask  types  on  each  side  of  it  are 
used  to  index  into  the  rule  table  to  find  the  list  of  rules  for 
that  kind  of  edge.  Each  rule  in  the  list  is  checked,  and  error 
information  is  recorded  for  any  areas  where  the  constraints  are 
not  satisfied.  For  each  edge  there  are  two  rule  applications: 
left-to-right  and  right-to-lcft  (for  vertical  edges)  or  bottom-to- 
top  and  to p-to- bottom  (for  horizontal  edges).  A  different  list 
of  rules  is  applied  in  each  direction,  since  the  mask  types  are 
reversed. 

3.3.  Specifying  Design  Rules 

Design  rnles  are  specified  in  a  technology  file  that  con¬ 
tains  the  rules  and  other  technology-specific  information. 
When  Magic  starts  executing,  it  reads  this  file  and  huilds  the 
rule  tahle.  Initially  we  specified  rules  in  the  detailed  form  of 
Tahle  1,  with  one  line  for  each  edge  rule.  Thu  scheme  proved 
to  be  unworkable,  because  there  were  many  rules  and  it 
became  difficult  to  convince  ourselves  that  the  rule  set  was 
complete  and  correct. 

In  order  to  simplify  the  process  of  creating  rule  sets, 
Magic  now  permits  rules  to  be  specified  with  high  level  macros 
for  width  and  spacing.  For  example,  the  macro 

spacing  efet.dfet  dme.pme  1 
is  expanded  into  several  rules  to  verify  that  enhancement  and 
depletion  transistors  are  always  separated  from  diffusion-metal 
contacts  and  poly-metal  contacts  hy  at  least  one  unit.  The 
macro 

width  poly, pane, buriod,efet,dfet  3 

is  expanded  into  the  set  of  edge  rules  needed  to  verify  that  any 
region  containing  any  of  the  five  mask  types  listed  it  always  at 
least  two  units  wide. 


Most  of  the  rules  for  our  processes  are  simple  width  and 
spacing  checks,  so  these  two  macros  considerably  simplify  the 
writing  of  rule  sets.  Our  nMOS  rule  set  contains  8  width  rules, 
0  spacing  rules,  and  10  of  the  detailed  edge  rules  fcr  situations 
that  cannot  be  handled  by  the  width  and  spacing  rules  (e.g. 
transistor  overhangs).  Magic  expands  these  24  high-level  rules 
into  12?  detailed  edge  rules.  Our  CMOS  process  requires  35 
high-level  rules  that  are  expanded  into  188  detailed  edge  rules 

The  width  and  spacing  macros  make  Magic's  checker 
more  efficient  because  the  width  and  sparing  rules  are  sym¬ 
metric.  If  layers  x  and  y  are  too  close  together,  the  violation 
can  be  detected  from  either  an  edge  of  x  or  an  edge  of  y.  This 
means  that  it  is  unnecessary  to  check  the  rules  from  both 
edges.  Magic  takes  advantage  of  this  symmetry  by  checking 
width  and  spacing  rnles  in  only  two  directions  (left-to-right 
and  bottom-to- top).  In  addition,  symmetric  rules  mean  that 
corner  extension  is  only  necessary  on  one  end  of  each  edge. 
Since  moat  of  the  detailed  edge  rules  come  from  the  width  and 
spacing  macros,  this  speeds  np  the  checking  process  hy  almost 
a  factor  of  two. 

Magic's  design-rnle  language  has  certain  limitations.  !l 
can  only  express  constraints  that  depend  on  a  limited  amount 
of  local  context.  One  example  of  a  rule  that  depends  on  m:r; 
extensive  context  is  a  rule  where  the  spacing  between  adjacent 
parallel  wires  depends  on  their  length.  Another  example  is  a 
reflection  rule  where  the  minimum  site  of  one  material  depends 
on  its  proximity  to  another  material.  In  the  processes  that  we 
use,  complex  rules  snch  as  these  are  replaced  with  more  conser¬ 
vative,  hut  simpler,  rules. 

4.  Continuous  Dsolgn-Ruin  Checking 

This  section  shows  how  the  hasic  checker  is  used  to  pro¬ 
vide  continuous  incremental  rule  validation.  As  in  the  previous 
section,  we  consider  only  single-cell  designs  here. 

In  order  to  perform  DRC  incrementally,  Magic  maintains 
two  extra  kinds  of  information  with  each  cell,  stored  in  the 
same  form  as  mask  layers.  First,  Magic  keeps  information 
about  rule  violations  that  have  been  detected  but  haven't  been 
corrected.  The  violations  are  represented  by  error  tiles  that 
cover  the  areas  where  rule  constraints  are  not  satisfied.  The 
second  kind  of  information  consists  of  tiles  describing  the  areu 
of  the  circuit  that  need  to  be  reverified.  The  error  tiles  and 
the  reverify  tiles  are  stored  in  separate  corner-stitched  planes. 
Each  cell  contains  its  own  error  and  reverify  planes. 

When  a  designer  changes  a  cell,  Magic  creates  reverify 
tiles  that  cover  the  area  modified.  The  design-rule  checker 
runs  in  background  while  Magic  is  waiting  for  the  designer  to 
enter  the  next  command.  DRC  first  searches  for  reverify  tiles. 
Then  it  invokes  the  hasic  checker  over  the  area  covered  hy 
each  tile  found.  The  hasic  checker  reverifies  the  area  on  each 
of  the  cell's  planes,  updates  error  tiles,  and  erases  the  reverify 
tile.  Changes  to  the  error  information  are  reflected  on  the 
graphics  screen. 

If  the  designer  invokes  a  command  while  the  checker  is 
running,  the  checker  stops  so  that  the  command  can  be  pro¬ 
cessed  without  delay.  After  the  command  finishes,  the  checker 
resumes  by  starting  over  on  the  area  that  it  was  working  on 
just  before  the  interruption.  Large  reverify  tiles  are  hroken  np 
into  small  ones  before  checking,  in  order  to  reduce  the  amount 


of  work  that  might  have  to  be  repeated.  When  there  are  large 
areas  to  be  reverified,  the  checker  works  across  the  design  in  a 
style  like  “Pae-Man,"  gobbling  op  reverify  tiles  and  spitting 
out  error  tiles. 

If  incremental  checking  is  done  carelessly,  errors  may  not 
be  detected  when  new  violations  are  introduced,  or  error  infor¬ 
mation  may  be  left  in  the  database  even  after  the  violations 
have  been  corrected.  Figure  S  illustrates  the  problem  and 
Magic's  solution.  When  an  area  is  modified,  error  information 
may  be  affected  in  both  the  area  that  was  modified  and  in  the 
surrounding  area  (for  example,  material  in  area  A  may  be  too 
close  to  something  in  the  surrounding  area  B).  We  call  the 
surrounding  area  the  halo.  Its  width  is  equal  to  the  largest 
distance  in  any  design  rule.  Error  information  must  be  recom¬ 
puted  in  the  modified  area  and  its  halo.  However,  errors  in  the 
halo  don't  necessarily  involve  the  inner  modified  area.  They 
may  come  from  interactions  between  the  halo  and  a  second 
halo  outside  it.  To  regenerate  errors  in  the  first  halo  correctly, 
information  in  the  second  halo  must  be  considered. 

When  area  A  of  Figure  5  is  modified,  Magic  rechecks  it  by 
deleting  all  error  information  in  A  and  B.  The  checker  then 
generates  new  error  information  in  both  areas  by  invoking  the 
basic  checker  over  areas  A,  B  and  C.  Any  errors  found  during 
this  process  are  clipped  to  the  area  of  A  and  B,  so  that  error 
information  is  not  affected  outside  the  region  where  errors  were 
erased. 

The  reverify  and  error  tiles  are  stored  with  cells  so  that 
they  are  not  lost  at  the  end  of  an  editing  session.  Normally, 
there  will  be  no  reverify  tiles  left  at  the  end  of  a  session,  but  if 
a  large  area  has  been  changed  recently,  it  is  possible  that  it 
won't  have  been  reverified  when  the  session  end).  In  this  case, 
the  reverify  tiles  are  written  to  disk  with  the  cell.  When  the 
cell  is  read  in  during  the  next  editing  session,  the  design-rule 
checker  will  notice  the  reverify  tiles  and  continue  the 
reverification  process.  The  reverify  and  error  tiles  are  identical 
to  the  tiles  used  to  represent  mask  layers,  except  that  they  are 
not  manipulated  directly  by  the  designer. 
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Figure  I.  If  area  A  is  modified,  the  design-rule  checker 
erases  existing  error  information  in  both  A  aad  B.  Errors 
in  B  could  hare  come  from  information  in  A,  B  or  C,  so  all 
three  areas  must  be  checked  to  regenerate  all  of  the  errors. 
The  width  of  the  halot  B  aad  C  is  equal  to  the  largest  dis¬ 
tance  in  any  design  rule. 


'  second  halo 


5.  Hierarchical  Checking 

Most  of  the  layouts  created  with  Magic  consist  of 
hierarchical  cell  structures  rather  than  single  cells  (Figure  6). 
Each  cell  may  contain  subcells,  and  the  subcells  may  overlap 
other  subcells  or  mask  information  in  the  parent.  A  subcell 
may  appear  any  number  of  times  in  any  number  of  parents. 

In  hierarchical  designs,  errors  can  arise  in  any  of  three 
ways: 

a)  the  mask  information  of  an  individual  cell  may  be 
incorrect; 

b)  a  subcell  may  interact  incorrectly  with  another  subcell; 
and 

c)  a  subcell  may  interact  incorrectly  with  mask  information 
in  its  parents. 

Magic's  incremental  checker  includes  facilities  to  detect  all  of 
these  errors.  Overlapping  subceils  are  no  more  difficult  to  han¬ 
dle  than  subcells  that  merely  abut,  since  interaction  errors  are 
possible  in  either  case. 


Figure  6  Circuits  are  defined  by  celli  arranged  in  a 
hierarchy.  If  mask  information  is  changed  in  a  low-level 
cell,  Magic  checks  to  be  sure  that  the  cell  is  consistent  by 
itself  and  that  there  are  no  illegal  interactions  in  parents  or 
other  ancestors. 


6.1.  Simple  Check*  end  Interaction  Check* 

Two  overall  rule*  guide  the  hierarchical  checker.  First, 
the  mask  information  in  every  cell  must  satisfy  the  design  rules 
by  itself,  without  consideration  of  subcells.  Second,  each  cell 
and  its  subcells  must  together  satisfy  all  the  design  rules, 
without  consideration  of  how  that  cell  is  used  in  its  parents.  If 
the  layout  is  viewed  as  a  tree  structure,  the  first  rule  means 
that  each  node  of  the  tree  must  be  consistent,  and  the  second 
rule  means  that  each  suhtree  mnst  be  consistent. 

The  overall  rules  result  in  two  kinds  of  design-rule  check¬ 
ing.  The  first  rule  is  verified  by  running  the  basic  checker  over 
the  planes  containing  mask  information  for  each  cell;  this  is 
called  a  simple  cheek.  The  second  rule  is  verified  with  an 
interaction  eheek  that  considers  interactions  involving  sub¬ 
cells.  Each  cell  uses  separate  planes  to  hold  its  mask  informa¬ 
tion,  so  interaction  checks  must  combine  information  from 
different  planes. 

To  make  an  interaction  check  on  an  area,  the  hierarchical 
structure  is  "flattened”  to  produce  a  new  set  of  corner-stitched 
planes  that  combines  all  the  information  from  all  cells  in  the 
area  to  be  checked.  This  includes  mask  information  from  the 
parent  cell,  plus  mask  information  from  subcells  and  sub¬ 
subcells,  and  so  on.  Once  all  the  mask  information  in  the  area 
has  been  collected  into  a  single  set  of  planes,  the  basic  checker 
is  invoked  on  these  planes  in  the  standard  fashion  (halo  expan¬ 
sion  is  performed  as  described  in  Section  4).  Errors  arising 
from  the  interaction  check  are  placed  in  the  parent  cell. 

Interaction  checks  are  mere  expensive  than  hasic  checks, 
since  they  involve  flattening  a  piece  of  the  hierarchy.  For¬ 
tunately,  interaction  cheeks  can  often  be  avoided.  For  exam¬ 
ple,  if  an  area  contains  no  subcells,  then  there  is  no  need  to 
perform  an  interaction  check  on  that  area.  A  simple  check  will 
find  all  errors.  The  interaction  check  can  also  be  avoided  if 
there  is  only  a  single  subcell  in  an  area,  with  no  other  subcells 
or  mask  information  nearby.  In  this  case  any  errors  must 
come  from  within  the  subcell,  and  those  errors  will  be  found  by 
checks  made  within  that  cell.  Interaction  checks  are  necessary 
only  in  areas  where  a  subcell  is  within  one  halo  distance  of 
mask  information  or  another  subcell.  Even  then,  we  only  need 
to  check  the  the  area  around  the  interaction. 

6.3.  Chocking  Upward  In  tho  Hierarchy 

When  a  cell  is  modified,  simple  checks  and  interaction 
checks  have  to  be  performed  within  that  cell,  and  also  within 
its  parents  in  the  hierarchy.  For  example,  suppose  mask  infor¬ 
mation  has  been  edited  within  a  cell.  Then  a  simple  cheek 
must  be  performed  within  that  cell,  as  well  as  an  interaction 
check  if  there  are  subcells  near  the  modified  area  However, 
these  two  checks  are  not  sufficient.  If  the  modified  cell  is  a 
subcell  of  ether  higher-level  cells,  then  the  change  may  have 
introduced  interaction  problems  within  the  higher-level  celb. 
For  each  parent  of  the  modified  cell,  an  interaction  check  must 
be  performed  over  the  area  of  the  modification.  Interaction 
checks  must  also  be  performed  in  grandparents,  and  so-on  up 
to  the  top-level  cell  in  the  hierarchy.  In  the  cell  that  was 
modified,  both  simple  and  interaction  checks  must  be  per¬ 
formed,  but  in  the  parents  and  grandparents  only  interaction 
checks  are  necessary. 


Magic  uses  two  kinds  of  verify  tiles  to  handle  the  two 
kinds  of  checks.  When  a  cell  is  modified,  “verify-all"  tiles  are 
placed  in  that  cell  to  signify  that  both  simple  and  interaction 
checks  must  be  performed.  At  the  same  time,  “verify- 
interactions"  tiles  are  placed  in  parents  and  grandparents  to 
indicate  that  interaction  checks  have  to  be  performed.  The 
background  checker  keeps  track  of  which  cells  in  the  database 
contain  verify  tiles  and  performs  each  kind  of  check  wherever 
necessary. 

In  the  worst  case,  the  hierarchical  algorithm  could  result 
in  the  modified  area  being  rechecked  once  at  each  level  of  the 
hierarchy  above  the  cell  that  was  changed,  with  a  separate 
flatten  operation  required  for  each  check.  However,  in  deep 
hierarchies  most  of  the  interaction  checks  are  avoidable:  in 
cells  far  above  the  modified  one,  the  modified  area  will  almost 
certainly  appear  in  the  middle  of  a  single  subcell  with  no  mask 
information  or  other  subcells  nearby.  Unless  there  are  many 
large  subcell  overlaps,  any  given  area  of  mask  information  is 
likely  to  require  an  interaction  check  at  only  one  point  in  the 
hierarchy. 

6.3.  Arrays 

One  other  form  of  hierarchical  check  arises  because  Magic 
has  an  array  construct.  To  simplify  the  creation  of  cell  arrays, 
Magic  contains  a  special  array  facility:  each  snbcell  may  con¬ 
sist  of  either  a  single  instance  or  a  one-  or  two-dimensional 
array  of  identical  instances.  Because  of  the  array  construct, 
there  is  actually  a  third  overall  rule  that  guides  the  hierarchi¬ 
cal  checker:  each  array  must  satisfy  all  the  design  rules, 
independently  of  other  information  in  the  parent  containing 
the  array.  Whenever  a  change  is  made  to  an  array,  the  array 
structure  is  reverified  by  checking  the  three  areas  shown  in 
Figure  7. 


i 


Figure  7.  The  elements  of  this  3  by  3  array  overlap  in 
both  the  horizontal  and  vertical  directions.  The  array  is 
internally  consistent  if  tbe  three  dotted  areas  satisfy  the 
design  rules.  AU  possible  interactions  between  elements  of 
tbe  array  are  identical  to  the  ones  that  occur  in  these  three 
regions. 
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0.  Implementation  and  Performance 

The  design-rule  checker  is  written  in  C.  Its  2800  lines  of 
code  are  divided  into  ronghly  eqnal  fifths  for  the  basic  fiat 
checker,  continuous  checking,  hierarchical  checking,  building 
the  internal  rule  table  from  the  technology  file,  and  the  com¬ 
mand  interpreter. 

The  basic  checker  processes  2000  tiles  per  second  on  a 
VAX  11/780  running  Unix.  Measurements  of  the  number  of 
edges  found  and  rules  checked  per  tile  are  given  in  Table  2  for 
a  typical  flat  cell  and  a  worst-case  cell  hierarchy.  Table  3  com¬ 
pares  Magic’s  performance  with  that  of  other  systems,  based 
on  transistors  per  second.  The  number  for  Magic  was  derived 
from  actual  designs  that  used  between  20  and  30  tiles  per 
transistor. 

A  typical  change  to  a  circuit  involves  only  a  few  tiles,  so 
the  cost  of  incremental  re  verification  is  dominated  by  the  site 
of  the  halos.  From  this,  we  estimate  that  roughly  50  tiles  hare 
to  be  cheeked  per  command  in  an  nMOS  design.  This  requires 
about  one-fortieth  of  a  second  of  CPU  time. 


Cell 

Tiles 

Tiles  / 
second 

Edges  / 
tile 

Rules  / 
edge 

RISC  floor 

64585 

2000 

0.0 

1.4 

ALU  latch 

7485 

500 

1.0 

1.6 

Tabla  3.  Performance  measurements  for  Magic's  detiga- 
rate  checker.  The  eolnma  for  edges/tile  shows  bow  maay 
edges  between  two  diferent  materials  were  found  per  tde. 
(This  anmber  eaa  be  less  than  one  because  adjacent  tiles 
eaa  hare  the  same  type.)  The  last  eolnma  shows  the  aver¬ 
age  number  of  rules  applied  across  each  of  these  non-trivia! 
edges.  The  RISC  floor  plan  contains  the  top  level  of  rant¬ 
ing  for  a  microprocessor,  but  no  snbeeUe.  The  aumber  of 
tiles  per  second  in  typical  for  flat  checking.  The  ALU  latch 
example  consists  oa  a  eell  copied  oa  top  of  itself.  This 
gives  a  worst-case  speed  for  hierarchical  checking,  where 
the  limiting  factor  is  the  time  to  flatten  the  hierarchy. 


System 

Transistors  /  second 

Lyra  [2| 

2 

Baker  [1] 

3 

Mart  |5j 

6-8 

Magic 

60-100 

TabU  3.  Performance  of  several  design  rule  checkers.  All 
of  the  programs  were  run  oa  a  VAX  11/780  Lyra  uses 
eoraer-baaed  rules,  torts  tiles  into  bias,  aad  is  written  it 
Lisp.  Mart  is  similar  to  Lyra,  but  is  written  in  C.  The 
Baker  checker  uses  a  raster-scan  approach  aad  it  alto  writ¬ 
ten  in  C  Unfortunately,  we  were  tot  able  to  obtain 
corresponding  results  for  industrial  design  rule  checkers. 


7.  Conclusions 

Magic's  design-rule  checker  demonstrates  that  incremen¬ 
tal  checking  is  feasible.  We  think  that  circuit  designers  will 
find  that  continuous  feedback  reduces  the  time  needed  to 
create  new  designs  or  modify  existing  ones.  The  key  to  the 
incremental  checker  is  low  overhead:  the  ability  to  run  from 
the  same  database  as  the  interactive  editor,  the  ability  to  find 
important  edges  in  the  layout  quiekly,  and  the  ability  to  find 
nearby  material  quickly.  The  two  features  of  Magic's  database 
that  reduce  overhead  are  the  corner-stitched  tile  planes  and 
the  abstract  mask  layers.  Extending  the  ehecker  to  work  in 
hierarchical  designs  frees  the  designer  from  tedious 
reverification  of  interactions  when  subcells  are  revised. 
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Abstract 

The  Magic  layout  editor  provides  a  new  operation  called  plow- 
inf,  for  stretching  and  compacting  Manhattan  VLSI  layouts. 
Plowing  works  directly  on  the  mask-level  representation  of  a 
layout,  allowing  portions  of  it  to  be  rearranged  while  preserv¬ 
ing  connectivity  and  layout-rule  correctness.  The  layout  and 
connectivity  rules  are  read  from  a  file,  so  plowing  is  technology 
independent.  Plowing  b  fast  enough  to  be  used  interactively. 
This  paper  presents  the  plowing  operation  and  the  algorithm 
used  to  implement  it. 


1.  Introduction 

Plowing  b  a  new  operation  provided  by  the  Magic  layout 
editor  (OHMST  84]  for  stretching  and  compacting  Manhattan 
VLSI  layouts.  It  allows  designers  to  make  topological  changes 
to  a  layout  while  ms  staining  connectivity  and  layout  rule 
correctness.  Plowing  c*n  be  used  to  rearrange  the  geometry  of 
a  subcell,  compact  a  sparse  layout,  or  open  up  new  space  in  a 
dense  layout.  In  a  hierarchical  environment  plowing  also 
allows  cell  placement  to  be  modified  incrementally  without  the 
need  for  rerouting.  To  avoid  dependence  on  a  particular  tech¬ 
nology,  plowing  b  parameter; i*d  by  a  set  of  layout  and  con¬ 
nectivity  rules  contained  in  a  technology  file. 

Conceptually  the  plowing  operation  b  very  simple.  The 
user  places  either  a  vertical  or  a  horitontal  line  segment  (the 
plow)  over  some  part  of  a  mask-level  representation  of  the  lay¬ 
out,  and  then  gives  the  direction  and  the  dbtance  the  plow  b 
to  move.  Plowing  can  be  done  up,  down,  to  the  left,  or  to  the 
right.  (The  rest  of  thb  paper  will  assume  plowing  to  the 
right.)  The  plow  b  then  moved  through  the  layout  by  the  dis¬ 
tance  specified.  It  catches  vertical  edges  (boundaries  between 
materials)  as  it  moves  and  carries  them  along  with  it.  Since 
only  edges  are  moved,  material  behind  the  plow  b  stretched 
and  material  in  front  of  the  plow  is  compressed.  Figure  1 
shows  how  plowing  can  be  used  to  open  np  new  space.  Figure 
2  shows  how  it  can  be  used  for  stretching.  Plowing  can  be 
used  to  compact  an  entire  cell  by  placing  a  plow  to  the  left  and 
plowing  right,  then  placing  a  plow  at  the  top  and  plowing 
down. 

The  work  described  here  was  supported  b  part  by  the  Defease 
Advanced  Research  Project*  Agency  (DoD)  under  Contract  No. 
N00034-K-0251 


(Mon)  (after) 

Figure  1.  Plowing  opens  sp  new  space  in  a  dense  layout. 
Geometry  is  poshed  in  front  of  the  plow,  subject  to  layout- 
rule  constraint*.  The  connectivity  of  the  origbal  layout  it 
maintained.  Jop  are  inserted  automatically  where  neces¬ 
sary. 


(Mon)  (after) 

Figure  S.  Material  to  the  left  of  the  plow  is  stretched. 

Material  to  the  right  it  compressed.  Objects  such  as 

transistors  do  not  change  in  sue. 

Plowing  b  so  named  because  each  of  the  edges  caoght  by 
the  plow  can  cause  edges  in  front  of  it  to  move  in  order  to 
maintain  connectivity  and  layout-rule  correctness.  These  edges 
can  cause  still  others  to  be  moved  out  of  the  way,  recursively, 
until  no  further  edges  need  be  moved.  A  mound  of  edges  thus 
builds  up  in  front  of  the  plow  in  much  the  same  manner  as 
snow  builds  up  on  the  blade  of  a  snowplow. 

Section  2  of  this  paper  discusses  plowing  in  the  context  of 
previous  work.  Sections  3  and  4  introduce  the  plowing  algo¬ 
rithm  for  a  single  mask  layer.  Section  5  extends  it  to  multiple 
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mask  layers  and  hierarchical  designs.  Finally,  Section  8 
presents  performance  measurements  and  our  experience  with 
plowing  in  the  Magic  system. 


2.  Background 

VLSI  layouts  are  difficult  to  modify.  Because  of  this, 
designers  are  often  committed  to  the  initial  choice  of  imple¬ 
mentation,  rather  than  being  able  to  experiment  with  alterna¬ 
tives.  Existing  cells  often  cannot  be  re-used  in  subsequent 
designs  because  they  don't  quite  fit;  it  is  typically  easier  to 
redesign  a  new  cell  from  scratch  than  to  modify  an  old  one. 
Bnp  in  a  dense  layout  are  hard  to  fix,  leading  to  a  debngging 
cycle  that  can  take  days  or  weeks. 

Many  of  these  difficulties  stem  from  the  fact  that  seem¬ 
ingly  small  changes  to  a  layout  can  hare  disproportionately 
large  effects.  Sometimes  this  is  for  electrical  reasons.  For 
example,  in  ratio  logic  such  as  nMOS,  changes  in  the  site  of 
one  transistor  may  necessitate  changes  in  the  sites  of  others. 
However,  even  purely  topological  changes— those  that  preserve 
the  electrical  properties  of  the  layout — can  require  much  more 
work  than  the  site  of  the  change  would  suggest.  As  Figure  1 
illustrated,  merely  opening  np  new  space  in  a  layont  can  cause 
effects  that  ripple  outward  over  a  much  larger  area.  Rearrang¬ 
ing  the  internal  geometry  of  a  cell  or  modifying  the  placement 
of  cells  in  a  floor  plan  can  be  similarly  expensive  because  of  the 
need  to  maintain  connectivity  with  the  surrounding  material. 

Previous  attempts  to  cope  with  the  re-arrangement  prob¬ 
lem  have  used  symbolic  design  or  sticks  [RBDD  83,  West  81, 
Will  78].  In  the  symbolic /sticks  approach,  designers  enter  lay* 
onta  in  an  abstract  form  containing  wires,  contacts,  and 
transistors.  The  symbolic  form  is  tben  run  through  a  compac¬ 
tor  to  generate  actual  mask  information.  As  part  of  the  com¬ 
paction,  tbe  circuit  elements  are  moved  as  close  together  as  the 
layont  rules  permit.  In  a  symbolic  design  style,  cells  can  be 
designed  loosely  without  worrying  about  exact  spacing!,  since 
tbe  spacinp  will  be  determined  by  tbe  compactor.  However,  it 
is  not  necessarily  any  easier  to  rearrange  the  topology  of  a 
symbolic  layout  than  it  would  be  to  rearrange  the  physical  lay¬ 
out. 


The  plowing  approach  has  many  of  the  advantages  of 
symbolic  layout.  It  allows  cells  to  be  designed  loosely  and  then 
compacted,  In  addition,  plowing  can  be  used  to  rearrange  cells 
or  open  up  new  space,  either  across  the  whole  cell  or  in  one 
small  portion.  Small  changes  can  be  made  in  one  area  without 
having  to  recompact  the  entire  cell,  whereas  a  global  recom¬ 
paction  may  potentially  shift  all  geometry  in  the  cell.  The 
plowing  approach  lets  the  designer  see  the  final  sites  and  loca¬ 
tions  of  all  objects  as  he  is  editing;  in  the  symbolic  approach,  it 
is  harder  to  predict  the  final  structure  of  a  cell  from  its 
abstract  form,  so  compaction  must  be  used  frequently  to  see 
the  results  of  a  change  to  tbe  symbolic  form. 


S.  Simple  plowing  algorithm 

Plowing  works  by  finding  edges  and  moving  them.  An 
edge  is  a  boundary  or  a  piece  of  a  boundary,  parallel  to  the 
plow,  between  material  of  two  different  types.  When  an  edge 
moves,  the  material  to  its  left  is  stretched,  and  the  material  to 


its  right  is  compressed.  In  this  section  we  will  describe  how 
plowing  works  when  only  a  single  mask  layer  is  present.  This 
material  will  be  assumed  to  have  two  layont  rules-  a  minimum 
width  of  w,  and  a  minimum  separation  of  *.  Edges  will  always 
be  boundaries  between  this  material  and  “empty"  space 

The  fundamental  step  in  plowing  is  to  move  a  single  edge 
The  plowing  algorithm  applies  a  series  of  rules  to  determine 
which  other  edges  must  move  as  a  consequence  of  this  motion. 
The  following  discussion  presents  plowing  as  though  it  moves  a 
given  edge  by  first  recursively  sweeping  all  other  edges  out  of 
its  way,  and  then  sliding  the  edge  into  the  newly  opened  space. 
Section  4  will  present  a  better  scheme  for  ordering  edge 
motions  than  this  depth-first  recursion. 
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Figure  1.  When  the  edge  e  moves  all  edges  in  area  A  (the 
area  swept  out  by  e)  must  be  moved  (a).  Moving  only 
these  edges  resalts  in  edge  /  moving  bat  not  edge  g  This 
leaves  a  layoat-role  violation  (b)  between  e  and  g  Search¬ 
ing  area  B  as  well  as  area  A  avoids  this  problem.  The  two 
areas  are  referred  to  collectively  as  the  smlrs  of  edge  e. 


t.l.  rinding  wdgns 

Figure  3  depicts  a  trivial  layout  consisting  of  three  uncon¬ 
nected  pieces  of  diffusion.  The  edge  labelled  e  is  to  be  moved 
to  a  final  position  indicated  by  the  arrowhead.  This  could  be 
either  because  c  was  caught  by  the  plow,  or  because  it  is  being 
moved  to  make  room  for  some  edge  to  its  left.  At  a  very 
minimum,  tbe  rectangular  area  labelled  A  must  be  swept  clear 
of  any  material  before  tbe  edge  can  be  moved.  However, 
because  of  the  spacing  rule,  any  material  inside  area  B  woo'd 
then  be  too  close  to  the  newly  moved  edge.  Consequently,  the 
area  to  be  swept  includes  both  areas  A  and  B.  The  union  of 


Figure  4.  When  tbe  edge  e  moves  (a),  edges  in  its  smbra 
must  be  moved  to  the  right.  If  only  edges  in  the  ambra  are 
moved,  however,  the  result  can  be  electrical  disconnection 
(b).  To  avoid  this,  plowing  also  moves  edges  in  the  penum¬ 
bra  to  the  right,  giving  tbe  correct  result  shown  in  (c).  This 
has  the  effect  of  inserting  jogs  automatically.  Tbe  height  of 
the  penumbra  is  or,  the  minimum  width  for  diffusion.  If 
diffusion  had  been  to  the  left  of  e  instead  of  to  the  right, 
the  height  of  the  penumbra  would  have  been  s,  minimum 
separation. 
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these  two  areas  is  referred  to  as  the  umbra  of  the  edge  c*. 

Plowing  most  also  search  a  bore  and  below  the  um  hr  a  to 
prevent  the  edge  from  sliding  too  close  to  other  edges  above  or 
below  it.  Figure  4a  shows  why  this  is  necessary.  If  material 
were  moved  out  of  the  umhra  alone,  as  in  Figure  4h,  the  result 
is  electrical  disconnection.  To  avoid  this,  plowing  must  also 
move  edges  out  of  the  areas  above  and  below  the  umhra.  The 
correct  result  is  shown  in  Figure  4c.  The  areas  above  and 
below  the  umhra  are  referred  to  collectively  as  the  pcnumira. 
Jog  insertion  is  an  automatic  consequence  of  searching  the 
penumhra.  Moving  edges  out  of  the  penumbra  also  prevents 
electrical  shorts,  as  can  be  seen  hy  reversing  the  roles  of 
material  and  space  in  Figures  4a-4c. 

The  left-hand  boundary  of  the  penumhra  is  not  always 
aligned  with  the  edge  being  moved.  Instead,  this  boundary  is 
formed  hy  following  the  outline  of  the  material  forming  the 
edge,  as  illustrated  in  Figure  5.  This  insures  that  the  penum¬ 
hra  contains  only  those  edges  that  must  move  in  order  to 
preserve  layout  rule  correctness  and  connectivity.  The  umhra 
and  pennmhra  of  an  edge  are  collectively  referred  to  as  its  sha¬ 
dow.  The  shadow  of  c  contains  all  the  edges  that  must  move 
as  a  direct  consequence  of  moving  e. 
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Figure  (.  If  e'a  penumhra  included  all  of  area  A,  as  shown 
in  (a),  thea  edge  /  would  be  found  and  moved,  resulting  in 
(h).  This  is  undesirable,  since  /  need  not  move  in  order  to 
preserve  layout-rule  correctness  and  connectivity.  A  better 
deluitiou  of  the  penumhra  is  area  B  only,  as  shown  in  (c). 
Searching  this  area  results  in  only  the  edge  g  being  found 
and  moved,  as  is  necessary  to  preserve  layout  rule  correct¬ 
ness. 

3.3.  Sliver  prevention 

The  rules  described  in  Section  3.1  guarantee  that  plowing 
never  moves  one  vertical  edge  too  clone  to  another.  However, 
they  do  allow  violations  to  be  introdnced  between  horitonta! 
segments  that  are  formed  when  material  is  stretched.  These 
violations  take  the  form  of  slivers  of  material  or  spaee  whose 
height  is  less  than  the  minimum  allowed.  Eliminating  snch 
slivers  requires  that  their  left-hand  edges  be  moved,  as  illus- 

*  la  a  tolar  eclipse,  the  umhra  is  that  portioa  of  the  moon's 
shadow  from  which  the  sun  appears  to  be  completely  eclipsed.  The 
penumhra  is  the  partial  shadow  surrounding  the  umbra,  la  plowing, 
the  umbra  of  an  edge  contains  edges  directly  in  its  path,  while  the 
peaumhra  contains  edges  to  either  side  of  its  path  but  nonetheless 
too  close 


trated  in  Figure  6.  The  left-hand  edge  of  each  sliver  lies  along 
the  left-hand  boundary  of  the  penumbra,  so  it  can  be  found 
when  tracing  the  outline  of  the  penumbra. 


Figure  S.  When  the  edge  e  moves  (a),  a  sliver  of  space  is 
introduced  below  the  horizontal  segment  A,  as  shown  in  (b). 
To  correct  this,  the  left-hand  edge  of  this  sliver,  /.  is  moved 
along  with  e,  but  only  as  far  as  the  right-hand  end  of  the 
segment  A  (c). 


4.  Breadth-first  vs.  Depth-first  Search 

In  the  previous  section,  plowing  was  described  as  a 
depth-first  search  in  which  all  edges  to  the  right  of  a  given 
edge  were  moved  before  the  edge  itself.  While  this  approach  is 
conceptually  clear,  it  has  poor  worst-ease  behavior.  An  N-tier 
lattice  structure  as  illustrated  in  Figure  7  requires  on  the  order 
of  2N  edge  motions,  because  plowing  performs  the  recursive 
search  to  the  right  of  an  edge  each  time  the  edge  is  moved.  If, 
as  in  the  example,  each  edge  must  be  moved  once  for  each  of 
its  two  neighbors  to  the  left,  the  edges  at  the  right-hand  side  of 
the  lattice  are  moved  a  number  of  times  that  is  exponential  in 
the  number  of  tiers. 
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Figure  7.  This  lattice  structure  causes  exponential  worst- 
case  behavior  in  the  depth-Srst  plowing  algorithm  when 
edges  in  the  shadow  are  processed  from  top  to  bottom 
The  objects  (A,  B,  etc.)  must  be  incompressible  to  cause 
this  worst-case  behavior.  Object  B  is  moved  once  when  ob¬ 
ject  A  moves,  then  slightly  farther  when  object  C  moves. 
The  numbers  to  the  left  of  each  object  show  how  many 
times  each  of  its  edges  is  moved. 


Lattice  structures  snch  as  this  one  are  fairly  common  in 
real  layouts;  a  routing  channel  containing  jogs  is  one  example. 
The  real  plowing  algorithm  must  avoid  paying  the  exponential 
cost  of  plowing  such  a  structure.  It  does  so  hy  waiting  until 
the  final  position  of  an  edge  is  known  before  it  performs  the 
search  to  the  right  of  that  edge.  This  strategy  causes  the 
number  of  edge  motions  to  be  linear  in  the  number  of  edges  in 
the  lattice.  (See  [Oust  84|  for  a  detailed  explanation). 

A  simple  way  to  insure  that  edges  are  moved  only  once 
their  final  positions  are  known  is  to  use  hreadtb-first  search. 
Magic  maintains  a  list  of  edges  to  be  moved,  sorted  in  order  of 
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increasing  x-coordinale.  On  each  iteration,  the  leftmost  edge  is 
removed  from  the  list  and  the  shadow  to  its  right  is  searched. 
Any  edges  discovered  hy  this  search  are  placed  in  the  list  along 
with  the  amount  they  must  move.  Since  the  final  position  of 
an  edge  can  only  be  affected  by  edges  to  its  left,  the  final  posi¬ 
tion  of  the  leftmost  edge  in  the  list  is  always  known. 

The  original  depth-first  algorithm  allowed  the  layout  to 
be  modified  incrementally  as  plowing  progressed,  since  an  edge 
was  never  moved  until  the  area  into  which  it  was  moving  had 
been  cleared.  Incremental  modification  is  impossihl:  with 
breadth-first  search,  since  edges  to  the  right  will  not  be  moved 
as  long  as  there  are  queued  edges  to  the  left  of  them  waiting  to 
be  moved.  Instead  of  actually  updating  the  layout  as  it 
progresses,  the  hreadth-first  version  of  plowing  stores  with 
each  vertical  edge  segment  the  distance  it  moves.  When  the 
shadows  of  all  edges  have  been  searched,  and  the  distance  each 
edge  moves  has  been  determined,  plowing  invokes  a  post-pass 
to  update  the  layont  from  the  information  stored  with  each 
edge. 


Figure  8.  When  processing  aa  edge  is  the  breadth-first 
approach,  it  is  important  to  ase  information  about  the  final 
positions  of  edges  that  have  already  been  processed,  la  (a), 
it  has  already  been  decided  to  move  edge  /,  but  the  edge 
will  not  actually  be  moved  util  all  other  edges  have  been 
processed.  If  edge  e  is  processed  withoat  considering  the 
new  position  of  /,  a  sliver  will  resalt  as  shown  in  (b).  In¬ 
stead,  the  plowing  algorithm  must  consider  the  eventual 
positions  of  edges  that  have  already  been  processed,  to  pro¬ 
duce  the  result  of  (e). 

However,  if  the  layout  is  not  modified  until  all  edges  have 
been  processed,  special  eare  must  be  taken  to  avoid  the  genera¬ 
tion  of  slivers.  Figure  8  illustrates  the  problem.  To  process 
each  edge  eorrectly,  it  in  important  to  know  what  other  edges 
have  been  already  been  processed  and  what  their  final  positions 
will  be.  In  general,  the  plowing  algorithm  must  consider  edges 
whose  final  positions  will  be  in  the  shadow,  rather  than  those 
whose  initial  positions  are  in  the  shadow. 

The  success  of  the  breadth-first  algorithm  depends  on  the 
fact  that  left-to-right  plowing  never  changes  the  order  of  edges 
along  any  horizontal  line,  and  never  ehanges  any  vertical  coor¬ 
dinates.  Furthermore,  each  edge  has  stored  with  it  the  dis¬ 
tance  it  is  going  to  move.  As  a  consequence,  plowing  can  use 
the  initial  layout  structure  for  searching,  and  yet  ean  easily 
find  all  objects  whose  final  coordinates  fall  in  a  given  area. 

5.  Extensions  for  read  layout* 

This  section  extends  the  simple  plowing  algorivhm  of  the 
previous  two  sections  to  handle  multiple  mask  layers  and 
fixed-site  objects  such  as  transistors.  It  presents  a  way  in 
which  the  number  of  jogs  introduced  by  plowing  can  be  eon- 
trolled.  Finally  it  describes  how  hierarchical  layouts  can  be 
plowed. 


5.1.  Multiple  mask  layers 

The  simple  version  of  plowing  assumed  that  the  shadow 
extended  to  the  right  of  the  final  position  of  a  moving  edge  by 
either  w  (the  minimum  width  rule)  if  material  lay  to  the  right 
of  the  edge,  or  s  (the  minimum  separation  rule)  if  material  lay 
to  the  left  of  the  edge.  This  insured  that  the  shadow  included 
alt  edges  directly  in  the  path  of  the  edge  being  moved  Since 
the  same  layout  rule  applied  between  the  edge  being  moved 
and  any  other  edge,  all  edges  found  during  the  search  of  the 
shadow  would  have  to  move. 

With  more  than  one  mask  layer  there  may  be  more  than 
one  layout  rule  to  apply  for  a  given  edge  For  example,  in  our 
nMOS  process,  the  minimum  separation  between  diffusion  and 
polysilieon  is  2  microns,  while  that  between  two  pieces  of 
diffusion  is  0  microns.  Both  of  these  rules  apply  at  an  edge 
between  diffusion  and  empty  space. 

In  section  3.1,  the  umhra  consisted  of  the  area  swept  out 
by  an  edge  being  moved,  plus  an  additional  area  to  the  right  of 
the  final  position  of  the  edge.  Because  several  rules  may  now 
apply  when  moving  a  given  edge,  the  width  of  this  additional 
area  must  be  the  longest  distance  of  any  layout  rule  associated 
with  the  edge  being  moved. 


Ftgurw  0.  The  area  of  a  shadow  search  is  determined  by 
the  worst-case  layout  rule.  However,  sot  all  edges  ia  that 
area  will  have  to  be  moved.  Edge  /  most  move,  because 
the  separation  between  two  polysilieon  features  must  be  i 
microns  and  edge  e  approaches  to  within  2  microns  of  /. 
Edge  i  need  not  move  since  the  minimum  separation 
between  potysilieoa  and  diffusion  is  only  2  micro  ns. 

However,  as  Figure  9  illustrates,  not  all  of  the  edges 
found  while  searching  the  umhra  must  actually  move.  Each 
edge  found  must  be  cheeked  for  its  minimum  allowable  separa¬ 
tion  from  the  edge  being  moved.  The  same  techniques  used  in 
Magic's  layout  rule  checker  [TaOu  84]  may  be  used  to  perform 
this  check  very  quickly. 

Multiple  mask  layer*  require  that  plowing  take  extra  care 
to  maintain  connectivity  with  material  above  and  below  an 
edge  being  moved.  In  the  single-layer  scheme,  the  penumbra 
nearch  guarantees  that  the  material  does  not  become  discon¬ 
nected.  However,  the  penumbra  search  follows  the  outline  of  a 
tingle  type  of  material,  so  it  will  not  hy  itself  guarantee  that 
two  adjacent  materials  of  different  types  will  remain  connected 
(see  Figure  10). 

Special  actions  must  be  taken  during  the  penumbra 
search  to  handle  horizontal  edges  between  different  materials 
First,  if  two  materials  share  a  horizontal  edge,  then  Magie 
guarantees  that  one  material  does  not  slide  past  the  end  of  the 
other:  it  maintains  a  minimum-width  connection  between  the 
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two  (this  is  the  ease  between  materials  A  and  B  in  Figure  10). 
Second,  if  one  material  completely  covers  the  edge  with 
another  material  (for  example,  the  A-C  edge  in  Figure  10), 
Magic  plows  the  other  material  as  mnch  as  is  needed  to  main¬ 
tain  complete  coverage.  This  ensures,  for  example,  that 
transistors  are  not  uncovered  by  plowing  polysilieon  off  one 
side. 
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Figure  10.  If  edge  e  if  plowed,  material  A  may  disconnect 
from  B  and  C  To  prevent  this,  a  minimum-width  segment 
of  edges  /  and  s  is  dragged  along  with  e.  The  edge  }  is 
moved  not  to  maintain  connectivity  (which  wonld  have 
been  achieved  by  moving  A),  bat  to  prevent  C  from  being 
ancovered.  In  (c),  ml  is  the  lesser  of  the  minimam  widths 
for  A  and  B,  mS  is  the  minimam  width  for  B,  and  m3  is 
the  minimam  width  for  C. 
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Flgurs  11.  When  inelastic  objects  are  present,  plowing 
may  have  to  cope  with  circular  dependencies.  Material  B  is 
inelastic,  and  A  and  C  are  both  minimum-width.  When 
edge  e  moves  by  distance  d  in  (a),  object  B  mast  move  by 
the  same  distance  to  prevent  A  from  being  uncovered.  To 
prevent  C  from  being  uncovered,  Cs  left-hand  edge  must 
move,  Anally  causing  edge  /  to  move  by  distance  d.  Edge  e 
is  in  r*  shadow  at  a  result,  but  should  not  be  moved  a 
second  time. 


S.2.  Inelastic  features 

Certain  features  in  a  layout  should  not  be  stretched  or 
compacted.  Transistors,  for  example,  have  sites  chosen  for 
electrical  reasons,  at  do  contacts.  Our  discussion  of  edge 
motion  has  assumed  that  the  material  forming  both  sides  of 
the  edge  was  stretchable.  When  material  is  inelastic,  both  its 
left-hand  and  right-hand  edges  must  be  moved  in  tandem. 

In  particular,  if  the  right-band  edge  of  a  piece  of  inelastie 
material  moves,  its  left-hand  edge  must  also  move.  Figure  11 
illustrates  how  this  can  lead  to  a  cycle  of  dependencies.  The 
plowing  algorithm  breaks  this  cycle  by  comparing  the  amount 
an  edge  is  supposed  to  move  with  the  motion  distance  already 
sto'ed  with  the  edge.  If  the  stored  motion  distance  is  greater, 
the  edge  need  not  be  moved  a  second  time. 

In  cases  whe'e  a  layout  rule  violation  exists  in  the  original 
layout,  an  infinite  loop  is  still  possible.  In  Figure  1 1 ,  for  exam¬ 
ple,  the  distance  r  between  edges  /  and  e  is  less  than  s,  the 


minimum  separation  allowed  Edge  e  initially  moves  by  dis¬ 
tance  d.  Plowing  should  move  all  edges  found  in  the  shadow  of 
/  far  enough  away  so  as  not  to  cause  any  rule  violations  with 
the  newly  moved  /.  Hence  edge  e  would  have  to  move  by 
d+t-r,  which  is  more  than  the  motion  distance  stored  with  the 
edge.  This  leads  to  an  infinite  loop  in  which  edge  e  is  moved 
by  an  addition  si  s-r. 

Plowing  avoids  this  sort  of  infinite  loop  by  never  moving 
a  shadowed  edge  (e)  more  than  the  edge  causing  the  shadow 
(/).  This  technique  prevents  infinite  looping  in  over¬ 
constrained  situations,  but  preserves  existing  layout  rule  viola¬ 
tions. 
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Figure  IS.  A  contact  is  duplicated  on  each  plane  it  con¬ 
nects.  When  an  edge  of  a  contact  is  moved  on  one  plane,  it 
is  moved  on  all  other  planes  as  well. 


6.3.  Noninteracting  planes 

Section  4  explained  that  the  order  of  vertical  edges  along 
a  horizontal  line  is  unchanged  by  plowing.  Thus  material 
being  plowed  can  never  slide  over  other  material  in  its  path. 
There  are  cases,  however,  where  it  is  desirable  that  certain 
materials  in  a  layout  move  independently.  Metal,  for  example, 
does  not  interact  with  either  polysilieon  or  diffusion  except  at 
contacts,  so  it  should  be  able  to  slide  over  them. 

To  allow  sliding,  Magic  segregates  the  mask  information 
in  a  layout  into  a  collection  of  non-interacting  planet. 
Material  in  one  plane  is  free  to  slide  past  material  in  any  other 
plane.  The  nMOS  technology,  for  example,  has  two  planes: 
one  to  bold  metal  wires,  and  one  to  hold  polysilicon,  diffusion, 
and  transistors. 

The  plowing  algorithm  operates  on  each  plane  indepen¬ 
dently.  The  only  interaction  between  planes  occurs  at  con¬ 
tacts,  which  are  duplicated  in  each  plane  that  they  connect 
When  an  edg*  of  a  contact  is  moved  in  one  plane,  the 
corresponding  edge  of  the  contact  in  all  other  planes  is  moved 
by  the  same  amount,  as  illustrated  in  Figure  12.  This  also 
moves  whatever  the  contact  connects  to  in  the  other  planes, 
thus  preserving  connectivity. 

6.4.  Jog  control 

Section  3.1  described  how  jog  insertion  was  an  automatic 
consequence  of  the  rules  plowing  uses  for  finding  edges  to 
move.  Plowing  creates  a  jog  whenever  it  moves  only  part  of 
the  boundary  between  two  different  types  of  material.  Unfor¬ 
tunately,  this  often  introduces  a  large  number  of  jogs,  which  is 
bad  both  because  it  increases  the  site  of  the  database  needed 
to  represent  the  layout,  and  becanse  it  may  reduce  fabrieabil- 
ity.  To  control  the  number  of  jogs  inserted  by  plowing,  the 
user  can  specify  a  “jog  boriton”.  Whenever  an  edge  is  about 
to  be  moved,  plowing  will  attempt  to  extend  it  np  and  down  to 
the  nearest  existing  jog  in  each  direction  If  an  existing  jog  is 
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found  within  the  jog  horizon  of  the  corresponding  endpoint  of 
the  edge,  the  existing  jog  is  used;  otherwise,  the  endpoint  of 
the  edge  is  used  to  form  &  new  jog. 

6.5.  Subeella  and  hierarchy 

One  approach  for  plowing  a  hierarchical  layout,  such  as 
that  shown  in  Figure  13a,  is  to  treat  it  as  though  it  were  non- 
hierarchical  and  propagate  edge  motions  inside  subcells.  This 
might  be  workahle  when  no  subcell  is  used  more  than  once. 
However,  Magic  instantiates  subcells  by  reference,  so  a  change 
in  one  instance  of  a  subcell  is  reflected  in  all  its  other  instances. 
Situations  in  which  a  subcell  is  used  more  than  once  ean  pro¬ 
duce  nnsatisfiable  sets  of  constraints,  as  Figure  I3h  illustrates. 

Magic  takes  a  simpler  approach,  which  is  to  view  subcells 
as  black  boxes  to  which  connectivity  must  be  maintained  by 
plowing,  but  whose  internal  structure  should  not  be  modified. 
A  benefit  of  Magic's  approach  is  that  plowing  can  be  used  to 
modify  the  placement  of  cells  at  the  Boor  plan  of  a  chip,  since 
it  only  changes  the  location  of  suhcells,  not  their  contents. 

When  any  mask  geometry  that  abnts  or  overlaps  a  cell  is 
moved,  the  entire  cell  must  move  by  the  same  amount.  Con¬ 
versely,  whenever  a  snbcell  moves,  all  mask  geometry  and 


Figure  IS.  Plowing  in  the  pretence  of  hierarchy,  (a) 
Plowing  might  treat  hierarchy  as  though  it  were  invisible  to 
the  user.  Each  of  cells  A  and  B  would  be  modifled.  (b) 
Cell  C  is  used  twice,  once  flipped  left- to- right  and  once  in 
its  normal  orientation.  Both  usee  refer  to  the  tame  master 
definition  of  C.  Moving  edge  e  to  the  right  is  impossible, 
because  it  requires  e  to  move  to  the  left  in  order  to  keep 
out  of  its  own  path  The  more  edge  e  it  moved  to  the  right 
in  the  left-hand  use,  the  worse  the  violate*  '-tomes. 


other  subcells  that  ahnt  or  overlap  it  must  also  move  hy  the 
same  amount.  The  net  effect  is  that  a  cell  behaves  like  flypa¬ 
per,  causing  all  geometry  over  its  area  to  “stick"  to  it  and 
move  as  a  whole  when  any  part  of  it  is  required  to  move. 

In  addition  to  preserving  connectivity  with  subcells,  when 
plowing  moves  other  geometry  it  mnst  avoid  introducing  any 
layont  rule  violations  with  the  geometry  inside  a  subcell.  One 
approach  for  dealing  with  this  is  to  define  a  protection  frame 
[Kell  82]  for  each  cell,  an  outline  around  the  cell  into  which  no 
material  may  be  plowed.  Magic  uses  an  extremely  simple  form 
of  protection  frame:  it  assumes  that  the  cell  contains  all  types 
of  material  right  up  to  the  border  of  its  bounding  box. 

For  example,  in  our  nMOS  rule  set,  the  worst-case  layout 
rule  involving  diffusion  is  the  diffusion-diffusion  spacing  rule  of 
0  microns.  An  edge  with  diffusion  to  its  left  can  be  plowed  to 
within  6  microns  of  a  snbcell  before  that  subcell  will  itself  have 
to  move.  The  worst-case  rule  distance  involving  polysilicon  is 
8  microns,  so  polysilicon  can  only  be  plowed  to  within  8 
microns  of  a  subcell  before  the  cell  mnst  move.  Since  the  con¬ 
tents  of  subcells  are  considered  unknown,  the  closest  one  sub- 
cell  can  be  plowed  to  another  before  the  other  will  have  to 
move  is  the  worst-case  layont  rule  in  the  entire  ruleset,  which 
in  our  ruleset  is  8  microns.  Of  course,  if  the  user  wishes  to 
overlap  two  cells,  he  can  still  do  that  nsing  other  editing  opera¬ 
tions  beside  plowing. 


S.  Results  and  experience 

Plowing  has  been  implemented  as  part  of  the  Magic  VLSI 
layout  system,  which  runs  under  the  Berkeley  t.2  Unix  operat¬ 
ing  system  on  either  VAXes  or  Sun  workstations.  About  7500 
lines  of  C  code  were  required  to  implement  all  of  the  features 
described  in  this  paper.  The  current  version  supports  plowing 
only  from  left  to  right,  but  is  currently  being  expanded  to 
operate  in  all  four  directions.  Table  I  gives  measurements  of 
the  performance  of  the  left-to-right  version,  nsing  several 
examples  taken  from  designs  at  Berkeley. 

We  have  had  no  real  user  experience  with  the  system  yet, 
since  it  is  just  now  becoming  operational  However,  the  initial 
reaction  from  designers  has  been  very  positive.  One  recently- 
discovered  problem  has  to  do  with  line  widths:  the  design 
rules  only  specify  minimum  widths  for  lines,  so  the  current  ver¬ 
sion  of  plowing  will  reduce  the  widths  of  lines  that  were  ini¬ 
tially  wider  than  minimum  width.  This  is  unacceptable  for 
many  signals  and  especially  for  power  and  ground,  so  the  plow¬ 
ing  implementation  is  being  modified  to  avoid  reducing  the 
widths  of  lines. 

Altbongh  plowing  is  .1  rather  tricky  operation  to  imple¬ 
ment  correctly,  it  is  very  simple  from  the  user's  standpoint, 
and  runs  quickly  enough  to  provide  interactive  response  even 
for  large  cells.  We  hope  that  it  will  simplify  the  task  of  re¬ 
arranging  circuits  topologically.  If  to,  it  will  mak<  it  easier  for 
designers  to  optimize  their  layonts,  and  will  also  ielp  them  to 
develop  intuitions  by  allowing  them  to  try  ont  many  alterna¬ 
tive  designs  easily. 


Example 

tiles 

cells 

edges 

time 

48-bit  bus 

101/480 

0 

382 

2.5 

ALU  latch 

430/472 

0 

785 

3.0 

Bus  driver 

848/1154 

13 

909 

S.S 

Table  L  Plowing  performance.  The  tiles  column  records 
the  aomber  of  tiles  of  mask  information  in  the  cell  being 
plowed,  before/after  plowing.  The  ctili  column  records  the 
number  of  subcells  in  the  cell  being  plowed,  and  edges 
records  the  number  of  edges  processed  during  plowing.  The 
time  is  in  seconds  on  a  VAX-11/780. 
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Alatract  We  address  the  optima]  logic  design  of  PLA-beeed  Finite  State 
llachines  (FSlf)  Techniques  related  to  heuristic  combinational  logic 
minimization  are  used  to  determine  optimal  coding  of  the  FSU  internal 
states  We  show  that  if  appropriate  Hamming -distance  requirements 
among  state  codes  are  preserved,  reduction  of  the  combinational  logic 
is  guaranteed.  A  state  encoding  technique  satisfying  these  require¬ 
ments  and  based  on  graph  embedding  in  squashed  hypercubes  is 
presented  Experimental  results  are  reported 

i.  nrrecoucnoN 

Sequential  circuits  play  a  major  role  in  the  control  part  of  digital 
systems.  We  address  the  automated  synthesis  of  sequential  logic  func¬ 
tions  in  a  structured  VLSI  design  methodology.  We  consider  sequential 
logic  functions  Implemented  by  synchronous  deterministic  Finite  State 
Machines  (FSU)  consisting  of  two  distinct  components:  a  combinational 
circuit  Implemented  by  a  Programmable  logic  Array  (PLA)  and  a 
memory  implemented  by  Delay-type  registers. 

In  particular  we  consider  here  the  problem  of  assigning  binary 
codes  to  the  internal  states  of  a  Finite  State  Machine.  The  literature  is 
rich  of  papers  dealing  with  the  state-assignment  problem.  Here  we 
refer  to  the  major  approaches  only.  Armstrong  [1]  introduced  a  set  of 
criteria  for  encoding  states,  aiming  et  the  minimization  of  the  number 
of  gates  used  to  implement  the  FSU  and  formulated  the  encoding  prob¬ 
lem  os  a  graph  embedding  problem.  Hartman  is  [2],  Stearns  [3]  and 
Karp  [4]  developed  algebraic  methods  based  on  partition  theory  end  on 
a  reduced  dependence  criterion.  Dolotta  and  UcCluakey  [S]  suggested 
a  'column-based"  procedure  to  code  states. 

Note  that  despite  these  efforts,  to  the  best  of  our  knowledge  no  tool  for 
designing  FSU  is  in  use  today  for  a  time-effective  state  encoding  of 
Industrial  digital  controllers. 

Armstrong'*  approach  can  In  principle  handle  rather  large 
machines,  but  it  has  three  serious  drawbacks.  The  first  is  related  to  the 
tact  that  the  criteria  suggested  by  Armstrong  do  not  take  into  account 
the  techniques  of  fast  heuristic  logic  minimisers  such  as  ICN1  [6], 
PRESTO  [7J.  or  ESPRESSO-n  [8]  in  use  today  (Armstrongs  paper 
appeared  before  the  work  on  heuristic  mlnlmisers  started)  The 
second  Is  that  tha  state- assignment  problem  is  transformed  into  a  part 
tlcular  graph-embedding  problem,  which  represents  only  partially  the 
state  coding  problem,  as  shown  in  section  4.  The  third  is  that  the 
graph  embedding  algorithm  auggested  by  Armstrong  was  ineffective. 

Our  approach  is  based,  as  Armstrong's,  on  the  use  of  distance 
relations  among  the  codes  of  the  internal  states  In  section  3  we  show 
how  the  combinational  logic  can  be  reduced  by  requiring  ststs  codas  to 
satisfy  appropriate  distances.  Distance  requirements  are  determined 
by  predicting  the  effects  of  heuristic  minimization  of  the  combinational 
logic  related  to  a  symbolic  description  of  tha  FSU,  and  are  represented 
by  a  graph.  In  particular  It  Is  shown  that  a  convenient  reduction  of  the 
combinational  logic  is  obtained  if  the  distance  between  some  state 
codes  la  large  enough  and  appropriate  states  have  adjacent  codes. 

In  section  4  we  consider  the  problem  of  asMgning  codes  which 
satisfy  the  distance  relations.  Adjacent  code  assignment  can  be  seen 
as  an  embedding  of  an  adjacency  graph  into  a  boolean  hypercub* 
Armstrong  [1]  and  Saunter  [8]  represented  the  state  assignment  prob¬ 
lem  as  a  subgraph  Isomorphism  problem,  where  e  one-to-one  relation 
(coding)  la  sought  between  the  set  of  the  states  (vertices  of  the  adja¬ 
cency  graph)  and  a  subset  of  tha  boolean  hjrpercube  vertices  (codes). 

Note  that  even  questioning  the  existence  of  a  subgraph  isomorphism  Is 
a  hard  problem:  in  particular  It  wee  shown  to  belong  to  tha  class  of  NP- 
oomplete  problems  [10]  Since  such  an  leomorphism  may  not  exists, 
Armstrong  and  Saucier  relaxed  some  adjacency  requirements  and  pro¬ 
posed  heuristic  technique*  to  embed  a  subgraph  of  tha  adjacency 
graph  into  the  boolean  hypercube  Not*  that  a  distance  preserving 
embedding  is  not  even  guaranteed  by  augmenting  the  dimensions  of 
the  hypercube,  l.e.  Increasing  the  length  of  the  state  codes. 

Our  approach  exploits  tha  us*  of  don't  eon  conditions  In  state 
codes  In  particular  every  state  is  coded  by  associating  each  vertex  of 
the  adjacency  graph  to  a  subcub*  of  the  boolean  hypercube.  This  is 
equivalent  to  embed  the  adjacency  graph  into  a  sqneehed  hypercube  , 
L*.  e  hypercube  having  appropriate  faces  squeezed  into  vertices  [11]. 
Note  that  most  of  the  state  assignment  techniques  presented  in  the 
literature  obtained  e  state  coding  using  the  minimum  number  of  bits, 
because  It  wee  important  to  minimize  the  number  of  memory  elements 
due  to  their  cost  On  the  other  hand,  the  area  taken  by  the  PLA  is  the 


major  concern  in  e  VLSI  circuit  implementation  of  e  Finite  State 
Uachine.  Minimal  area  PLA  implementations  of  the  FSU  combinational 
component  can  be  obtained  by  using  non-minimal-length  state  codings 
L*.  fewer  product-terms  are  often  required  to  implement  e  logic  func¬ 
tion  et  the  expense  of  en  increased  number  of  input  /output  columns 
Therefore  we  allow  non-minimal-length  state  codings  when  leading  to 
minimal  area  PLAs  In  this  case,  state  coding  corresponds  to  en 
embedding  Into  a  squashed  hypercube  of  variable  dimension  However 
bounds  on  eode-iength  can  be  enforced  when  required  by  e  particular 
implementation. 

S.  UNITE  STATE  MACHINE  ISPHESENTATIOK 
Different  functional  FSU  representations  are  commonly  used 
Ucet  state-assignment  techniques  reported  in  the  literature  are  based 
on  a  state-table  representation,  though  it  can  be  cumbersome  for  large 
uncompletetyapec tiled  machine*  For  this  reason  designers  describe 
the  machine  functionality  by  means  of  flow-charts  or  Hardware 
Description  Languages  (HDL).  Unfortunately  these  descriptions  are  not 
well-suited  to  support  machine  optimization  techniques  For  these  rea¬ 
sons  we  represent  tha  FSU  functionality  by  means  of  a  symbolic  eorar . 
The  concept  of  symbolic  cover  is  e  generalization  of  the  logic  cover 
representation  of  combinational- logic  functions  [6]  Symbolic  covers 
can  be  obtained  from  flow-charts,  HDL  or  state  tables  in  a  straight¬ 
forward  way. 

A  symbolic  cover  is  e  set  of  primitive  elements  called  eymbotc 
tmpUcsots  .  A  symbolic  implicant  (denoted  hers  by  e  capital  letter  e.g. 
A  =  [is.  I* .  *' t ,  o4j)  is  a  set  of  two  input  and  two  output  character 
strings.  The  two  input  strings  teprssent  e  binary-valued  representation 
of  a  primary  input  (i*)  and  a  symbolic  representation  or  e  present 
state  (t4).  The  two  output  strings  represent  the  corresponding  sym¬ 
bolic  representation  of  the  next -state  (*’*)  and  a  binary-valued 
representation  of  the  primary  outputs  (o4).  Not*  that  we  consider  in 
this  paper  the  problem  of  assigning  binary  codes  to  the  FSU  internal 
states  only.  Therefore  we  assume  that  ijt  and  ot  are  already  coded  into 
binary  strings.  However  i4  and  oA  might  describe  symbolic  inputs  and 
outputs  in  a  more  general  framework,  where  primary  input  and  output 
coding  is  also  considered.  We  represent  binary  valued  variables  by  the 
symbols  ”1",  "0"  and  where  represents  a  don’t  co re  condition. 
States  ere  represented  symbolically  by  a  character  mnemonic  string. 
EtampU  Consider  the  traffic-light  controller  presented  in 
[  IS].  The  following  Is  a  symbolic  Impiicant 
11*,HG.HY,10010 

showing  the  a  ”1”  in  tha  first  two  primary-input  lines  maps 
state  "HD"  into  state  "HY"  and  asserts  output  10010  The  sym¬ 
bolic  cover  Is  the  collection  of  tha  symbolic  impiicants 
representing  the  state  transitions: 

0~.HG,HC, 00010 
•OhHC.HC.00010 
lirHG.HT, 10010 
••0.HY.HY.001 10 
"l.HY.rc.lOllO 
10*,PC,PG.01000 

ov'rc.FY,  nooo 
•i\rc,  ft,  nooo 
•vo.nr.FYOiooi 

••l.Fir.HC.llOOl  s 

Not*  that  a  symbolic  cover  Is  a  logic  cover  of  e  multiple-valued  logic 
function  [4]  [  13],  where  each  state  takes  *  different  logic  level  and  is 
represented  by  a  character  string.  A  symbolic  implicant  having  n  (m) 
primary  input  (output)  bits  can  be  seen  as  *  (n*l)-input,  (m-l)-output 
multiple-valued  logic  implicant.  Deficit  ions  and  properties  of  multiple- 
velued-loglc  covers  carry  over  to  symbolic  covers  as  well  [13] 

Tha  motivation  for  taing  a  symbolic  cover  relies  on  the  following 
points. 

I)  properties  and  opirstions  on  symbolic  covers  esn  be  exploited 
and  related  to  haurintic  minimization  algorithms  for  binary-valued 
logic  functions  [6),[?],[8J. 

II)  any  logic  cover  of  the  combinational  component  of  e  FSH 
obtained  by  assigning  disjoint  codas  to  each  state  can  be  seen  as  s 
symbolic  cover  Hence  tbs  technique  we  present  can  be  interfaced 
to  several  FEU  automated  design  tools,  in  order  to  implement  U>* 
machine  aiming  specifically  to  a  PLA  based  implementation  in  s 
minimal  area. 


ihc  state  problem  consist*  of  determining  *  coding  cup 

c(  )  which  transforms  the  state  symbols  into  string*  of  binary  digits. 
This  ia  equi  relent  to  transforming  the  symbolic  cover  into  e  binary- 
valued  logic  cover  of  the  combinational  component.  Note  that  In  gen- 
era]  don't  care  coordinates  are  used  in  state  codes,  and  therefore 
every  state  is  assigned  to  e  subcube  of  the  boolean  hypercube.  How¬ 
ever  e  coding  map  is  impless ntahie  only  if  the  states  are  assigned  to 
non-overlapping  regions  of  the  boolean  hypercube. 

State  coding  affects  substantially  the  complexity  of  the  combina¬ 
tional  component  of  a  FSM,  because  minimal  binary-valued  logic  covers 
[6]  corresponding  to  different  coding  mops  hove  different  cardinalities. 
We  consider  first  coding  maps  with  no  code-length  bounds  The  uncon¬ 
strained  ofrtimom  Mata  assignment  problem  can  be  stated  as  follows: 
Find  an  tmp lamentable  coding  map  c  (  )  that  minimises  the  cor- 
dmaldy  of  th*  minimal  logic  cover  of  the  FSU  combinational 
component 

This  is  s  formidable  task,  because  it  Involves  the  search  for  all  the 
minimal  covers  related  to  all  poesible  codings!  We  therefore  concen¬ 
trate  on  a  simpler  problem  and  we  relate  optimal  state  coding  to 
heuristic  minimisation  of  ths  logic  cover  [0]  (8j.  In  particular  we  look 
for  an  lmplementable  coding  map  which  ieeda  to  e  minimal  logic  cover 
having  significantly  fewer  impUcants  than  the  original  symbolic  cover. 
Similarly  e  constrained  state  satgruMst  problem  can  be  defined  by 
restricting  the  search  to  codings  of  bounded  length. 

3.  COOK  DISTANCES  AND  C0HBQU3TCMAL  LOGIC  lENMIZATlOH 

We  Investigate  in  this  section  the  relations  betwsen  stets  assign¬ 
ment  and  the  complexity  of  the  related  implementation  of  the  combi¬ 
national  part  of  a  FSU.  In  particular  a  eat  at  rules  can  be  obtained  to 
determine  canatrainta  on  etate  code  dietancee.  so  that  either  the  cardi¬ 
nality  or  the  number  of  literals  of  the  logic  cover  (or  both)  can  be 
reduced.  However  we  report  here  on  the  two  major  rules  only. 

We  call  eoba  any  etring  of  characters  from  the  aet  |0,1,*|  We 
refer  tha  reader  to  [6]  for  definitions  of  cover  (3).  union  (u).  sharp  (-) 
end  Intersection  (ri)  between  cubes.  Ths  distance  D(a,  6  )  between 
two  cubes  a  end  6  of  equal  length  is  tha  number  of  positions  in  which 
they  differ.  Ths  Hamming  distance  H (a.  b  )  between  two  cubes  a 
end  b  of  equal  length  is  ths  number  of  positions  in  which  they  differ 
and  both  entries  are  cares.  Note  that  if  si  and  *2  are  two  different 
state  symbols.  an  lmplementable  coding  is  such  that 
H(c(tl).  C  (m2))  >  0.  Wa  define  two  state  codas  to  be  atgarant.  if 
their  Hemming  distance  is  one.  because  they  are  adjacent  vertices  of  e 
squashed  cube  representation. 

The  basic  strategy  for  obtaining  a  sat  ol  relations  among  sis  la 
codes  ia  tha  following.  All  pairs  of  symbolic  ImpUcants  (  A.  B  )  ere 
examined  and  code  distance  requirements  are  enforced  according  to 
the  fallowing  rules.  When  Rule  1  applies,  two  symbolic  impUcants  can 
be  coded  end  merged  into  one  binary-valued  logical  trapHcant  and  tha 
cover  cardinality  be  reduced.  Therefore  Rule  1  Is  considered  a  "strong 
rule"  end  it  is  highly  desirable  that  tha  related  code  distance  require¬ 
ments  are  satisfied.  Rule  2  allows  to  reduce  tha  number  of  literals  end 
ia  considered  a  "weak  rule"  compered  to  Rule  1,  because  a  reduction  in 
size  of  tha  PLA  is  considered  more  desirable  than  a  reduction  of  its 
compisxity. 

Let  A  *  Ha  ,  e,,  I  'A ,  o,|  be  e  symboUc  impUcant  of  ths  machine 
cover.  We  define  S(A)  tha  set  of  states  which  are  mapped  by  any 
input  representation  <  C i,  either  into  a  next-state  different  from  i  j  or 
into  an  output  representation  not  covered  by  0,  or  both. 

Bale  I:  Let  A  -  0,1,  B  =  Hg.  tg.  tg.  oB|  be 

two  symbolic  impUcants  such  that-  i,  and  o,  30p. 

Than: 

‘  (•' a  )3e  (•'»)■ 

and: 

H(c(ia)uc(;).c(iq))>0  Vtpc  S(A). 

Rational a.  A  and  B  can  be  coded  and  merged  into  only  one 
logic  ImpUcant.  namely- 

Ha.  c(*a)Uc(*i)-  •>(•'»)■  °aI  ■ 


Rule  l  requires  two  different  conditions  on  state  code*  i)  a  coveting 
nistion  between  cubes  c  (t‘A  )  end  c  (t’g)  considered  as  output  parts 
«f  binary- valued  ImpUcants:  11)  a  distance  relation  which  keept  state 
cades  e  (tA  )  and  c(tg  )  tor  from  tha  codes  of  tha  state*  in  S(A). 

Remark  If  s',  -  *'§  the  covering  requirement  Is  automatical 
ly  satisfied,  lioreaver  if  only  completely  (pacified  codes  are 
uiad  (La  no  don't  cart  conditions  tre  used  In  state  codes), 
tlisn  D(c  fa, ),  c(tm))  =  1  implies  that 

H(c(tA)uc  (tg),  c(tg))>  0  Vf?es,  and  tqdtg  This  re¬ 
quirement  is  equivalent  to  tha  column  adjacency  rule  stated  by 
Armstrong  In  (1),  when  i,  =  ig  and  o,  *  Og  Note  that  Rule  1 
la  far  more  general  than  Armstrong'!  rule  a 


Let  A  =  Jig.  *s.  *'s.  °aI  and  &  -  Hg.  tg,  s'g,  oBj  be  two  symbolic 
impUcants  such  that-  *,  =  tg  end  0,  =  Og  We  define  /  ( AB  )  the  set 
of  input  representations  I  i  Ci,  <Jig  |  which  map  state  I,  either  into  e 
next-state  different  from  s',  or  t  g  or  into  an  output  representation 
not  covered  by  o,  or  both. 

Bile  2  Let  A  =  |i,.  s,.  s',,  o,  j  and  B  =  (tfi.  tg,  t  g,  0g\  be 
two  symbolic  impUcants  such  that:  s,  =  tg  and  o,  —  og  if 
/ ( AB  )  -  $  ,  then: 

H(c(fA),  c(fB))=  1. 

Rationale  the  corresponding  logical  impUcants  can  be 
reshaped  (6]  as:  ^ 

KsU**-  e(sA).  cfsj).  o,j 
Ho.cftg).  c(t'g)  -  c(t'A),  dj 
ghere  d  is  e  airing  of  "Q''s  and  where  without  icas  of  generality 
c  (t ',  )  and  o(t'g  )  ere  obtained  by  assigning  cares  to  the 
dorjjf  cars  ^entries  of  the  next-stete  codes  _so  that 
D( c(t'A),  c Cs'g))  -  1  and  the  "1"  count  in  c(t'B)  is 
larger  than  in  c(t'A  )  Note  that  the  second  logical  impiicent 
obtained  by  Rule  2  has  always  only  one  care  in  the  output 
part.  Hence  it  may  be  covered  by  same  other  implicents  of  the 
logical  cover.  Therefore  when  Rule  2  applies  the  number  of 
Uterels  and  possibly  the  logical  cover  carduudity  are  reduced  ■ 

Rule  2  requires  an  adjacency  relation  between  cubes  c  (t'A)  end 
c(fg) 

Rtmark  If  D(\A.  ig)  =  1.  then  I(AB )  =  $  .  Therefore,  if 
we  restrict  our  attention  to  completely  specified  codes  only. 
D(e(t'A).  c(t’g))  *  1  ImpUes  that 

H(c(i‘a),  c(t'g))“  1.  This  condition  is  equivalent  to  the 
row  adjacency  rule  presented  by  Armstrong  In  [1]  ■ 


More  complex  rules  can  be  derived  by  considering  other  relations 
between  symboUc  ImpUcants.  In  particular.  Rule  2  cen  be  generalized 
to  tha  case  in  which  0,  30g  and  Rule  1  be  modified  to  the  case  in  which 
H(oA.Og)a  1. 

4.  STATE  EMCOOOfC  STRATBUES 

The  rulea  stated  in  Section  3  give  rise  to  relations  among  stats  codes 
which  can  be  grouped  as  follows: 

1)  code  adjacency  (tffc  fi,  /  c(ss;;  =  l); 

2)  cods  covering  .  i.s.  requiring  e  next -etate  code  to  cover  another 
next -state  code  (  c(t'  Afac  (t’g  ))\ 

3)  code  distance,  Le.  requiring  the  code  of  one  state,  say  S  j  ,  to  be 
far  enough  from  Use  union  of  the  codes  a f  a  pair  of  atstse  s,  and 
•»  (  H(c  (tA  M  (•» ).c(*q))>  0). 

Relations  1)  and  2)  are  represented  by  a  mixed  weighted  graph. 
G{V,E.W(E)),  where  the  sat  of  nodes  V  is  in  one-to-one  correspon¬ 
dence  with  tha  sat  of  states,  and  E  consists  of  a  set  of  directed  and 
undirected  edges.  Tha  undirected  edges  are  related  to  the  adjacency 
relatione,  Le.  j  e  £  If  H (c  (tA  ).  c  (tg))  ~  1;  and  the 

directed  edges  are  related  to  covering  relations,  i.s.  (u, . )  G  E  if 
H(c  (tA),  e(tg))  a  l  and  c(tA)3c(tg).  Weights  ars  defined 
according  to  tha  number  of  times  tha  same  distance  requirement 
occurs  and  to  tha  related  nils.  Cods  diatanc#  requirements  are 
represented  by  a  list  structure.  In  particular 
H( c(lA)uc(tg),  c  (tg))  >  0  is  raprasantad  by  tg  pointing  to  the 
pair  ■,  and  tg  in  the  list 

Tha  problem  of  finding  a  state  coding  which  satisfies  the  rules 
given  In  Section  3  can  now  be  teen  as  a  graph  embedding  problem  Let 
N  be  the  dimension  of  a  boolean  hypercube  B.  representing  the  possi¬ 
ble  codee  Let  P{B)  be  the  eet  of  all  subcubas  contained  In  B  Wa 
have  to  determine  the  dimension  N  and  an  injective  function 
c  :V-»P(B)  such  that  the  relations  Induced  by  the  rules  presented  in 
Section  3  art  satisfied.  The  adjacency  relations  are  satisfied  If 

«te(l4.Vy,;»//(c(til).c(v;))  VVa.V]  ViOVj  C  V 

where  d$(u,.v< )  la  ths  length  of  the  shortest  path  In  the  graph 
between  V,  end  Vj  and  H(c  (u,).c  (vj  ))  denotes  tba  Hamming  distance 
of  the  subcubee  cfa)  and  c(vj)  Geometrically  we  would  Uke  to 
determine,  if  possible  an  Isomorphism  between  ths  graph  and  a 
squashed  hypercube,  Le.  an  hypercuba  B  In  which  some  elements  of 
P(B)  collapse  Into  a  vertex.  It  can  be  proven  that  there  always  exists 
an  integer  N  such  that  in  inject  I w  map  c:B-»P(B)  satisfying  lbs 
above  relations  can  be  found  Howsier,  It  Is  important  not  only  to 
reduce  tba  number  of  product  terms  of  ths  FSU  combinational  com¬ 
ponent,  but  also  to  keep  N  as  small  as  possible  because  N  it  propor¬ 
tional  to  ths  number  of  columns  required  by  a  PLA  implementation. 
Three  optimisation  strategies  can  be  followed' 

t)  Set  N  to  «  find  value  and  find  c( )  such  that  the  number  of 
code  constraints  notated  hy  the  encoding  is  minimised 
>)  Find  the  smallest  f)  such  that  all  the  rules  are  satisfied: 
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9)  Tmd*-ofl  N  and  th*  number  ol  cooalrminta  TioUted. 

Stratrfy  1  U  ckaee  to  the  one  followed  by  Armetron*  where 
N-  (logs I  V  \ ,  i.e.  the  minimum  number  of  bits  needed  to  encode  the 
etates.  Strategy  3  is  the  most  desirable  but  obriously  the  most  difficult 
to  Implement.  We  decided  to  Implement  strategy  2  as  an  intermediate 
step  towards  strategy  3.  A  first  theoretical  question  to  ask,  when 
Implementing  strategy  2,  is  whether  a  bound  on  N  can  be  found 

If  we  require  that; 

dc(t/,  .Vj)=H(c  (ti<),c(t/j))  Vxii.Vj  Vi+Vj  €  V 

we  have  an  isometric  embedding  of  i  freph  into  a  squashed  hypercube 
Graham  shoved  in  {11]  that  any  graph  G(V.E)  can  be  embedded  into 
an  hyparcuba  of  dlmetxsion  AI(GJ=(]  K| -l)diam(G)  where  diam(C)  is 
the  diameter  of  G,  and  conjectured  that  the  bound  can  be  love  red  to 
N(G)-\  V\  —1.  The  conjecture  can  be  proven  true  for  graph,  belong¬ 
ing  to  come  special  classes  ,e  g  complete  graphs  The  graph  embed¬ 
ding  problem  arising  from  our  formulation  is  a  distance-bounded  graph 
embedding  It  can  be  reduced  to  an  isometric  embedding  into  a 
squashed  hypercube  by  appending  appropriate  edgee  to  G  Therefore 
there  always  exists  a  coding  map  c(  )  satisfying  the  given  requirements 
having  N(G)~ |  H|  — 1. 

We  present  in  Fig  1  the  flowchart  of  a  heuriatic  algorithm  for 
distance-bounded  graph  embedding,  which  reminds  of  ths  procedure 
presented  in  [14]  for  isometric  embedding  The  algorithm  trial  to 
minimise  N  and  is  constructs  a  coding  using  N<  |  V\  —1  bits.  Note  that 
this  is  a  worst-case  upper  bound  end  that  ths  computed  code,  arc 
much  shorter  then  |  K|  —  1  In  many  practical  cmos. 

The  algorithm  applies  to  connected  graphs.  If  G(V.E.W(E))  is 
disconnected,  its  connected  components  ere  determined  Brat  and  the 
different  groups  of  codes  are  pecked  together  et  the  end.  We  deal  hare 
with  a  connected  graph  far  the  sake  of  simplicity. 

The  algorithm  visits  each  nods  of  the  graph  u, .  i=l . |  V\  and 

at  the  k-th  step  constructs  a  partial  encoding  of  length  N(K)  for  v» .  It 

appends  one  bit  to  the  codec  of  the  nodes  ,  1  =  1 . k  only  if  the  code 

length  must  be  Increased,  as  shown  In  Fig.  1.  A  degree  of  freedom  of 
our  procedure  is  ths  order  of  the  selected  nodes.  We  choose  os  first 
node  the  one  which  correspond,  to  ths  slats  with  mswimnm  number  of 
occurrences  as  next -state  in  ths  symbolic  cover.  We  map  it  into  the 
origin  of  the  coordinator  of  ths  hypercube,  to  maximize  the  occurrence 
of  "0"s  in  the  output  p»  t  of  the  coded  implicante.  The  node  selected  at 
the  k-th  atap,  u,  is  adjacent  to  a  coded  vertex  (Le.  adjacent  to  vy.j<k) 
and  has  the  maximum  number  of  uncoded  adjacent  nodes  The 
rationale  is  that  such  a  nods  has  mars  constraints  to  satisfy  and  so  it 
deserves  highsr  priority  In  the  apace  occupation  on  the  hypercube. 

At  step  k  node  v,  is  coded  as  follows.  Assume  we  have  assigned 
partial  codings  of  code  length  N(K-l)  to  ti,.  1  =  1 . A -1  so  that. 

(«*).<:  («/))*  1 

for  all  adjacent  coded  pairs  V,  end  Vj .  Than  we  search  for  an  tmpla- 
man table  coding  c  (v,  )  ol  the  same  length  with  the  property  that: 

H(c(v„),e(vj))=l 

for  all  adjacent  coded  poire  lk  and  Vj,  under  the  constraint: 

H(e  (o,  ).c  (vr))uc  (y,  )))>0 

for  ell  node  pairs  Vr.v,  in  the  list  pointed  by  u,  and  coded  before  step 
k.  An  exhaustive  search  of  s  feasible  code  would  require  OPw-if  trials. 
Therefore  we  teat  only  a  subset  of  the  possible  trials,  which  ere  called 
"slight  modifications"  of  ths  coding  of  the  vertices  adjacent  to  v,  A 
alight  modification  is  obtained  by  complementing  one  cere  bit  ("1"  or 
-O’-)  of  the  cade  of  e  vertex  adjacent  to  v,  .There  ore  at  nxmt  |  Fl  *  such 
trials. 

It  is  pass!  bis  that  no  tmplomontabls  coding  far  V,  can  be  obtained 
by  slight  modifications  In  this  case  the  algorithm  constructs  the  code 
of  Vs  by  appending  a  ”1"  to  the  string  of  bits  obtained  from  the  logical 
union  of  the  codas  of  ail  ths  adjacent  vertices  and  by  appending  ■  "1" 
or  “0"  or  to  the  code  of  each  vertex  Vj  coded  before  stop  k.  In  this 
wey.ws  con  always  satisfy  the  distance  requirements,  but  unfortunately 
at  the  expanse  of  on  lncreeee  in  the  code  length.  However,  the  algo- 
ritlmi  frill  construct  a  valid  encoding  for  C  of  length  bounded  by 

*  *  I  P|“l  fie  computational  complexity  le  0(|  V\*)  in  the 


Erm mk.  Since  e  bound  on  the  code  length  can  be  obtained  by 
bounding  ths  number  of  vertices  In  each  connected  component 
of  G.  wo  can  partition  the  graph  into  component*  of  bounded 
size  by  removing  a  subset  of  edgee.  Edge  weights  own  be  used 
to  determine  the  optimal  graph  decomposition. 


S.  BPQ&HDfTAL  RESULTS  AND  CONCLUDING  ROARKS 

The  algorithm  has  been  implemented  by  an  interactive  computer 
program.  The  program  reods  the  symbolic  description  of  a  FSM.  gen¬ 
erates  ths  distance  requirements  and  determines  state  codes  The 
program  has  bean  tested  on  a  set  of  industrial  Finite  State  Machines 
Results  ora  reported  in  Table  1  and  show  that  the  algorithm  is  effective 
in  generating  state  codes  leading  to  a  FSM  implementation  with  a 
reduced  number  of  product-terms  in  the  combinational  component 
Execution  times  ore  in  the  order  of  some  eeconds  on  a  IBM  3081  com¬ 
puter. 

A  new  approach  based  on  multiple-valued  logic  minimization  is 
being  currently  pursued  in  collaboration  with  Dr  Brayton  of  IBM  Prel¬ 
iminary  experimental  results  show  that  this  method  can  be  considered 
e  break-through  in  FSM  synthesis. 
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Abstract  — Circuit  simulation  programs  base  proven  lo  be  most  im¬ 
portant  computer-aided  design  tools  for  the  analysis  of  the  electrical 
performance  of  Integrated  circuits.  One  of  the  most  common  analyses 
performed  by  circuit  simulators  and  the  most  expensive  in  terms  of 
computer  time  is  nonlinear  time-domain  transient  analysis.  Conventional 
circuit  simulators  Here  designed  initially  for  lhe  cosi-effeclive  analysis  of 
circuits  containing  a  few  hundred  transistors  or  less.  Because  of  lhe  need  to 
verify  the  performance  of  larger  circuits,  many  users  have  successfully 
simulated  circuits  contaii.ino  thousands  of  transistors  despile  lhe  cost. 

Recently,  a  new  class  of  algorithms  has  been  applied  to  the  electrical  1C 
simulation  problem.  New  simulators  using  these  methods  provide  accurate 
waveform  information  with  up  to  two  orders  of  magnitude  speed  improve¬ 
ment  for  large  circuits.  These  programs  use  re taxation  methods  for  the 
solution  of  the  set  of  ordinary  differential  equations,  which  describe  the 
circuit  under  analysis,  rather  than  the  direct  sparse-matrix  methods  on 
which  standard  circuit  simulators  are  based. 

fn  this  paper,  the  techniques  used  in  relaxation-based  electrical  simula¬ 
tion  are  presented  in  a  rigorous  and  unified  framework,  and  the  numerical 
properties  of  the  various  methods  are  explored.  Both  the  advantages  and 
lhe  limitations  of  lhese  techniques  for  the  analysis  of  large  ICs  are 
described. 

I.  Introduction 

IRCUIT  simulation  programs,  such  as  SPICE2  (1]  and 
ASTAP  [2],  have  proven  to  be  most  important  com¬ 
puter-aided  design  tools  for  the  analysis  of  the  electrical 
performance  of  integrated  circuits  (IC’s).  These  programs 
can  perform  a  variety  of  analyses,  including  dc,  ac,  and 
time-domain  transient  analysis  of  circuits  containing  a 
wide  range  of  nonlinear  active  circuit  devices  such  as 
MOSFETs  and  bipolar  junction  transistors  [3]. 

One  of  the  most  common  analyses  performed  by  circuit 
simulators  and  the  most  expensive  in  terms  of  computer 
time  is  nonlinear  time-domain  transient  analysis.  By  per¬ 
forming  this  analysis,  precise  electrical  waveform  informa¬ 
tion  can  be  obtained  if  the  device  models  and  parasitics  of 
the  circuit  are  characterized  accurately.  However,  conven¬ 
tional  circuit  simulators  were  designed  initially  for  the 
cost-effective  analysis  of  circuits  containing  a  few  hundred 
transistors  or  less.  Because  of  the  need  to  verify  the  perfor¬ 
mance  of  larger  circuits,  many  users  have  successfully 
simulated  circuits  containing  thousands  of  transistors  de¬ 
spite  the  cost.  For  example,  a  700  MOSFET  circuit, 
analyzed  for  4  fis  of  simulated  time  with  an  average  2-ns 
time  step,  takes  approximately  4  CPU  hours  on  a  VAX 

M  arui  scrip  received  May  25,  t9R3  This  work  was  supported  m  pari  by 
DARPA  under  Comract  N00030-K  025I,  bv  JSF.P  under  Conlract 
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11/780  VMS  computer  with  floating-point  accelerator 
hardware. 

Gate-level  logic  simulators  (e.g.,  [4],  [5])  and  switch-level 
simulators  [6]-[8J  can  verify  circuit  function  and  provide 
first-order  timing  information  more  than  three  orders  of 
magnitude  faster  than  a  detailed  circuit  simulator.  How¬ 
ever.  to  verify  circuit  performance  for  critical  paths,  mem¬ 
ory  design,  and  analog  circuit  blocks,  it  is  often  essential  to 
perform  accurate  electrical  simulation.  In  some  companies 
the  simulation  of  circuits  containing  many  thousands  of 
devices  is  performed  routinely  and  at  great  expense.  In 
recent  years,  considerable  effort  has  been  focussed  on 
techniques  for  improving  the  speed  of  time-domain  electri¬ 
cal  analysis  while  maintaining  acceptable  waveform  accu¬ 
racy. 

A  number  of  approaches  have  been  used  to  improve  the 
performance  of  conventional  circuit  simulators  for  the 
analysis  of  large  circuits.  The  time  required  to  evaluate 
complex  device  model  equations  has  been  reduced  using 
table-lookup  models  [9]— [13].  Techniques  based  on  special- 
purpose  microcode  have  been  investigated  for  reducing  the 
time  required  to  solve  sparse  linear  systems  arising  from 
the  linearization  of  the  circuit  equations  [14],  Node-tearing 
techniques  have  also  been  used  to  exploit  circuit  regularity 
by  bypassing  the  solution  of  subcircuits  whose  state  is  not 
changing  [15],  [16]  and  to  exploit  the  vector-processing 
capabilities  of  high-performance  computers  such  as  the 
CRAY-1  [17].  In  all  cases,  the  overall  speed  improvement 
of  the  simulation  has  been  at  most  an  order  of  magnitude, 
for  practical  circuits. 

Recently,  a  new  class  of  algorithms  has  been  applied  to 
the  electrical  IC  simulation  problem.  New  simulators  using 
these  methods  provide  as  accurate,  or  more  accurate,  wave¬ 
forms  than  standard  circuit  simulators  such  as  SPICE2  or 
ASTAP  with  up  to  two  orders  of  magnitude  speed  im¬ 
provement  for  large  circuits.  These  simulators  have  been 
used  for  the  analysis  of  both  digital  and  analog  MOS  IC’s. 
They  use  relaxation  methods  for  the  solution  of  the  set  of 
ordinary  differential  equations,  (ODE's)  which  describe  the 
circuit  under  analysis,  rather  than  the  direct  sparse-matrix 
methods  on  which  standard  circuit  simulators  are  based. 

A  broad  survey  of  decomposition  techniques  for  the 
simulation  of  large-scale  integrated  circuits  can  be  found  in 
[18].  In  this  paper,  the  techniques  used  in  relaxation-based 
electrical  simulation  are  presented  in  a  ngorous  and  unified 
framework  and  the  numerical  properties  of  the  various 
methods  are  explored.  Both  the  advantages  and  the  limita¬ 
tions  of  these  techniques  for  the  analysis  of  large  IC’s  are 
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described.  In  Section  II.  some  of  the  fundamental  problems 
associated  with  conventional  circuit  simulation  algorithms 
as  circuit  size  increases  are  exposed  and  the  mathematical 
basis  for  the  relaxation  approach  is  introduced.  In  Section 
111,  the  special  relaxation  methods  called  timing  simulation 
algorithms  are  described  and  their  numerical  properties  are 
investigated.  In  Section  IV,  iterated  timing  analysis ,  which 
applies  relaxation  techniques  at  the  nonlinear  equation 
level  [19],  is  described  briefly  and  its  convergence  proper¬ 
ties  are  proven.  The  waveform  relaxation  method  [20],  [21], 
which  applies  relaxation  techniques  at  the  differential 
equation  level,  is  presented  in  Section  V,  and  various 
techniques  which  can  be  used  to  improve  its  performance 
for  electrical  simulation  are  described.  Concluding  remarks 
and  areas  requiring  further  research  are  presented  in  Sec¬ 
tion  VI. 

II.  Circuit  Equation  Formulation  and 
Standard  Relaxation  Techniques 

A.  Equation  Formulation 

Before  the  techniques  used  in  relaxation-based  simula¬ 
tion  are  presented,  the  particular  electrical  simulation 
problem  to  be  solved  must  be  defined.  Although  relaxa¬ 
tion-based  methods  can  be  used  with  a  variety  of  technolo¬ 
gies  (e.g.,  [23]),  they  are  particularly  suited  to  the  analysis 
of  large  MOS  digital  lC’s,  as  will  become  clear  later.  Thus 
to  help  clarify  the  presentation,  the  following  simplifying 
assumptions  are  made: 

•  All  resistive  elements,  including  active  devices,  are 
characterized  by  constitutive  equations  where  voltages 
are  the  controlling  variables  and  currents  arc  the  con¬ 
trolled  variables. 

•  All  energy  storage  elements  are  two-terminal,  possibly 
nonlinear,  voltage-controlled  capacitors. 

•  All  independent  voltage  sources  have  one  terminal 
connected  to  a  ground  or  can  be  transformed  into 
independent  current  sources  with  the  use  of  the  Nor¬ 
ton  transformation. 

Under  these  assumptions,  the  circuit  equations  can  be 
formulated  in  terms  of  a  nodal  analysis  that  yields  N 
equations  in  N  unknown  node  voltages  [24],  where  there 
are  N  + 1  nodes  in  the  circuit  and  node  N  + 1  is  the 
reference  node,  or  ground. 

An  important  assumption  required  by  relaxation-based 
electrical  simulators  is  that  a  two-terminal  capacitor  be 
connected  from  each  node  of  the  circuit  to  the  reference 
node.  This  assumption  is  satisfied  by  circuits  where  lumped 
parasitic  capacitances  are  present  between  circuit  intercon¬ 
nect  and  ground  or  on  the  terminals  of  active  circuit 
elements. 

Under  these  assumptions,  the  nodal  equations  can  be 
written  in  the  form 

C(v(t),u(t))0  f)«  - /(»(/),«(/)).  0 

o(0) -F  (1) 

where  »(/)eR"  is  the  vector  of  node  voltages  at  time  /; 
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Fig.  1 .  Circuit  simulator  now  diagram  for  transient  analysis. 

v(t)<E  R"  is  the  vector  of  time  derivatives  of  v(t)\  u(t)G  R" 
is  the  input  vector  at  time  t,  C(-):  R"  -»R"X"  represents 
the  nodal  capacitance  matrix,/:  R"xR"-*Rn,  and 

f(v(t),u(t))  -  lfl(v(t),u(t)),/2(v(t),u(l)),-  -, 

•/n(v(t),u(t))]T 

where  f(v(t), «(/))  is  the  sum  of  the  currents  charging  the 
capacitors  connected  to  node  i.  In  the  following  sections 

(1)  will  be  referred  to  in  a  simplified  form  where  the  time 
dependencies  are  expressed  implicitly,  i.e., 

C(v,u)b--/(v,u).  (2) 

B.  Standard  Circuit  Simulation 

A  simplified  flow  diagram  for  the  solution  of  these 
equations  by  a  conventional  circuit  simulator  is  shown  in 
Fig.  1.  Once  the  circuit  description  has  been  read  by  the 
program  and  the  data  structures  required  for  simulation 
have  been  assembled,  the  main  analysis  loop  (Steps 

(2) — (13))  is  entered. 

At  each  new  analysis  time  point,  the  information 
from  previous  time  points  is  used  to  predict  the  solution 
at  t„+ Stiffly  stable  integration  formulas,  such  as  Back¬ 
ward  Euler  (BE),  the  Trapezoidal  Rule  (TR),  or  Gear’s 
Variable-Order  Method  (GE),  with  variable  time  steps,  are 
used  to  discretize  (1)  at  Steps  (4)  and  (5)  [3].  This  process 
yields  a  set  of  nonlinear,  algebraic  difference  equations  of 
the  form 

g(*)-0  (3) 

where  x  e  R*  is  the  vector  of  node  voltages  at  time  i,+]. 

These  equations  are  solved  using  a  damped  Newton- 
Raphson  algorithm  to  yield  a  set  of  sparse  linear  equations 
of  the  form 

Ax  -  b  (4) 

where  AGRNxN  is  a  matrix  related  to  the  Jacobian  of  g 
and  b  e  R  N  [3].  Typically,  less  than  2  percent  of  the  entries 
of  A  are  nonzero  for  N  >  500.  These  equations  are  then 
solved  using  direct  methods,  such  as  sparse  LU  decomposi¬ 
tion  or  Gaussian  Elimination,  Steps  (7)  and  (8). 
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Steps  (5)— (9)  are  repeated  until  the  Newton- Raphson 
process  converges  or  the  upper  bound  on  the  number  of 
iterations  is  reached.  The  program  then  decides  whether  to 
accept  the  solution,  based  on  its  estimate  of  local  trunca¬ 
tion  error  (LTE)  and  the  number  of  Newton- Raphson 
iterations  required  in  Steps  (5)— (9).  A  new  time  step  is 
computed,  and  Steps  (2)— (1 3)  are  repeated  until  the  simu¬ 
lation  is  complete  (3). 

This  procedure  has  proven  to  be  reliable  and  accurate. 
For  large  circuits,  the  process  can  take  a  considerable 
amount  of  computer  time,  as  illustrated  in  Section  I.  The 
majority  of  the  time  spent  in  Steps  (2)— (13)  can  be  lumped 
into  two  categories:  the  time  required  to  solve  the  system 
of  sparse  linear  equations,  solve  (Steps  (7)  and  (8)),  and 
the  time  required  to  form  the  entries  of  A  and  b  in  (4), 
form  (Steps  (5)  and  (6)). 

Fig.  2  shows  the  amount  of  CPU  time  required  to 
perform  a  transient  analysis  of  a  set  of  typical  circuits  of 
increasing  size.  For  this  example,  the  number  of  circuit 
nodes  N  is  used  as  a  measure  of  circuit  size.  The  time 
required  for  equation  preprocessing  is  not  included  here; 
only  time  involved  in  the  actual  time-domain  transient 
portion  of  the  simulation  is  shown.  A  simple  RC  ciicuit 
was  chosen  for  this  example  to  emphasize  the  increasing 
cost  of  matrix  solution  time.  The  example  was  constructed 
by  calling  an  increasing  number  of  cells,  in  an  hierarchical 
manner,  each  with  the  same  matrix  structure  and  an  aver¬ 
age  number  of  fanouts  between  2.5  and  3.  This  approach 
preserved  the  observed  properties  of  most  real  circuits 
while  providing  a  uniform  technique  for  increasing  circuit 
size. 

As  can  be  seen  in  Fig.  2  for  small  circuits  (N  <  20),  the 
majority  of  the  solution  time  is  spent  performing  form. 
However,  when  the  size  of  the  circuit  grows,  an  increasing 
percentage  of  the  time  is  spent  in  the  solve  phase.  While 
the  actual  percentages  may  vary  depending  on  the  circuit 
under  analysis,  the  complexity  of  the  nonlinear  device 
models  used  by  the  program,  and  the  computer  on  which 
the  simulator  is  running,  this  trend  is  true  for  all  standard 
circuit  simulators  running  on  conventional  computers.  For 
MOS  circuits  analyzed  on  a  VAX11-780  UNIX  computer, 
the  crossover  point  is  at  around  500  nodes.  The  time  spent 
in  the  equation  solution  phase  has  been  measured  to  grow 
as  0(NP),  where  1.1  <  <  1.5.  In  particular,  for  large 

circuits  /)  has  been  found  to  depend  on  the  difference 
between  the  time  required  to  perform  arithmetic  operations 
and  the  memory  bandwidth  of  the  computer.  On  the  other 
hand,  the  time  required  for  form  grows  linearly  with  the 
number  of  circuit  elements  and,  therefore,  with  the  number 
of  circuit  equations  for  typical  circuits.  The  time  spent  in 
the  load  phase  can  be  reduced  by  simplifying  the  device 
model  equations,  using  table  look  up  models  (9)— (13}.  or 
providing  special-purpose  instructions  to  update  A  and  b 

[U). 

For  most  circuits  the  fraction  of  nodes  which  are  chang¬ 
ing  their  voltage  value  at  a  given  point  in  time  decreases  as 
the  circuit  size  increases.  For  circuits  containing  over  500 
MOSFETS,  fewer  than  20  percent  of  the  node  voltages 
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Fig  2  Transienl  analysis  lime  for  circuits  of  increasing  size. 

change  significantly  over  a  simulation  time  step.  Only  the 
circuit  equations  representing  these  active  nodes  must  be 
solved  at  any  time.  Circuit  simulators  exploit  this  time 
sparsity  or  latency  by  using  device-level  [1)  or  block-level 
[16],  (25),  [17]  bypass  schemes.  In  a  device-level  bypass 
scheme,  if  the  terminal  voltages  and  branch  currents  of  a 
circuit  element  did  not  change  significantly  in  the  previous 
Newton- Raphson  iteration,  its  contributions  to  ,4  and  b  in 
(4)  are  not  reevaluated,  and  the  values  computed  during 
the  previous  iteration  are  used.  In  block-level  bypass,  both 
the  matrix  element  evaluation  and  the  node  solution  steps 
are  bypassed  for  each  block  of  inactive  connected  circuit 
elements.  While  the  aforementioned  techniques  do  reduce 
the  total  execution  time  for  conventional  circuit  simulators, 
the  savings  are  often  not  sufficient  for  the  cost-effective 
electrical  simulation  of  LSI  circuits. 

C.  Linear  Relaxation  Methods 

Relaxation  methods  can  be  used  for  the  solution  of  (1)  in 
a  number  of  ways.  In  all  cases,  their  principal  advantages 
stem  from  the  fact  that  they  do  not  require  the  direct 
solution  of  a  large  system  of  linear  equations  and  from  the 
fact  that  they  permit  the  simulator  to  exploit  latency 
efficiently. 

Relaxation  methods  can  be  applied  at  different  stages  in 
the  solution  of  (1),  as  illustrated  in  Fig.  3.  The  two  most 
common  methods  used  in  electrical  simulation  are  the 
Gauss-Jacobi  method  and  the  Gauss- Seidel  method  [26], 
[27). 

For  the  solution  of  the  linear  equations,  relaxation  meth¬ 
ods  can  replace  direct  methods  for  the  solution  of  (4).  Let 
A  be  split  into  L  +  D  +  (/,  where  L  e  R"  is  strictly  lower 
triangular.  De  R"  is  diagonal,  and  U  e  R"  is  strictly  upper 
triangular.  Then  the  two  methods  mentioned  earlier  have 
the  following  form  when  applied  to  the  solution  of  (4), 

Gauss  -  Jacobi: 

Dxk  * 1  **  -  (  L  +  U)xk  +  b  (5a) 

or 

•r“l-  -  D~l((L  +  U)x1'  -  b)  ~  MCJxk  +  D'b  (5b) 
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Bela»ation-Baeed  Circuit  SUndird  Circuit 
Simulation  SlmuUtiop 

'"i,  =  M* 

[  x.  =  f.(x,.x.) 

lnte|reUon  Formulif 
(e.|.  Backward  Euler) 


ti(x,  *.x,  -)=0 

*,(x,.x,)=0 

=b 

[  J*(x,.jx,)_=0  _  _ 

Nonlinear  Gauaa-Seldel 

Newton  -  Re  pb ton 

1 

«,|X,  *+a»x,  *‘,=b1 

a„x,=b,  1 

»«Xi  “+»«a.  *  =  b. 

linear  Geuaa-Seidel  Ciuhiid  Elimination 

or  LU  Decomposition 

Fig  3.  Parallel  between  standard  circuit  simulation  techniques  and  re¬ 
laxation-based  techniques 

Gauss  -  Seidel: 

(L+D)xk*'  =  -Uxk+b  (6a) 

or 

x*"1- -(L  +  D)''(Uxk-b)~  Mcsxk  +(L+  D)~'b 

(6b) 

where  xk  is  the  value  of  x  at  the  A  th  iteration. 

Since  relaxation  methods  are  iterative  methods,  it  is 
important  to  ask  under  what  conditions  they  are  guaran¬ 
teed  to  converge  to  the  solution  of  (4). 

Note  that  the  iterations  are  not  well  defined  if  D  is 
singular.  That  is.  if  there  is  a  zero  on  the  main  diagonal  of 
A.  It  is  well  known  that  a  necessary  and  sufficient  condi¬ 
tion  for  the  iterations  defined  by  (5b)  and  (6b)  to  converge 
to  the  solution  of  (4),  independent  of  the  initial  guess  x0,  is 
that  the  eigenvalues  of  MCJ  and  Mcs  be  inside  the  unit 
circle  in  the  complex  plane  [26].  However,  this  condition  is 
not  practical  from  a  computational  point  of  view  and  other 
conditions,  in  general  sufficient  conditions,  are  used  to 
check  the  convergence  of  these  methods.  In  particular,  it 
can  be  shown  that  if  A  is  strictly  diagonally  dominant,  then 
both  the  Gauss-Jacobi  and  the  Gauss-Seidel  iteration 
converge  to  the  solution  of  (4),  Other  sufficient  conditions 
can  be  found  in  [26],  [28]. 

Another  important  convergence  property  of  iterative 
methods  is  rate  of  convergence.  It  can  be  shown  that  if  the 
Gauss-Jacobi  and  the  Gauss-Seidel  iteration  converge, 
they  converge  at  least  linearly.  That  is,  after  a  sufficiently 
large  number  of  iterations,  the  error  at  each  iteration 
decreases  according  to 

||x*  +  ,-x||<<||x*  -x|| 

where  i  is  the  solution  of  (4). 

The  computational  cost  of  both  of  these  methods  is 
O(N),  compared  with  0(AM  ,‘l  5)  for  direct,  sparse-matrix 
techniques.  Thus  relaxation  methods  are  advantageous  from 
a  computational  point  of  view  with  respect  to  sparse- 
matrix  techniques  only  if  the  number  of  iterations  needed 
to  obtain  convergence  is  of  the  order  of  N°  *.  In  addition. 


sparse-matrix  techniques  are  based  on  Gaussian  elimina¬ 
tion  or  LU  decomposition  and,  if  exact  arithmetic  is  used, 
they  obtain  the  exact  solution  of  (4)  in  one  step.  Relaxation 
techniques,  as  mentioned  previously,  are  not  guaranteed  to 
converge.  Reliability  is  the  basic  reason  why  sparse-matrix 
techniques  have  been  used  more  frequently  than  relaxation 
techniques  in  conventional  circuit  simulators.  If  the 
Gauss-Seidel  method  is  used,  reordering  of  the  equations 
has  an  effect  on  the  number  of  iterations  needed  to  obtain 
a  solution  of  (4).  For  example,  if  A  is  upper  triangular,  N 
iterations  are  needed  to  obtain  the  exact  solution  of  (4). 
However,  if  A  is  reordered  into  lower  triangular  form,  the 
solution  of  (4)  is  obtained  in  a  single  iteration.  If  the 
Gauss-Jacobi  iteration  is  used,  reordering  of  the  equations 
has  no  effect  on  the  speed  of  the  algorithm. 

The  Gauss-Seidel  method  can  be  shown  to  converge 
faster  than  the  Gauss-Jacobi  method  on  a  class  of  prob¬ 
lems  [26], 1  For  example,  if  A  is  lower  triangular, 
Gauss-Seidel  converges  to  the  exact  solution  of  (4)  in  one 
iteration  while  Gauss-Jacobi  converges  in  N  iterations. 
However,  the  fact  that  at  each  iteration  each  xk  *  *,  i  = 
1,*  •  \N,  does  not  depend  on  any  xk*\  y  - 1,  •  •  •  N;  j  *  i 
in  the  Gauss-Jacobi  method  means  that  the  computation 
of  all  xk  +  *i  ~  1,-  •  -,  N  can  proceed  in  parallel.  This  method 
is,  therefore,  well  suited  to  modem  multiprocessor  com¬ 
puters. 

D.  Nonlinear  Relaxation  Methods 

Relaxation  methods  can  also  be  used  at  the  nonlinear- 
equation  solution  level  to  augment  the  Newton- Raphson 
method,  and  hence  replace  the  linear-equation  solution 
based  on  sparse-matrix  techniques.  Let  xk  denote  the  value 
of  x  at  the  A  th  iteration.  The  Gauss-Jacobi  and  Gauss- 
Seidel  algorithms  when  applied  to  (3)  have  the  following 
form: 

Nonlinear  Gauss-Jacobi  Algorithm: 

repeat  [  forall  ( j  in  N )  ( 

solve  g,(x,‘,-  •  -  ,x*  +  1,-  •  -  -0  for***1;) . 

until  (||x*+I  -  jc*||  <c)  (7) 

that  is,  until  convergence  is  obtained.  The  forall  (/  in  J) 
construct  specifies  that  the  computations  for  all  values  of  / 
in  the  set  J  may  proceed  concurrently,  i.e.,  in  parallel  and 
in  any  order. 

Nonlinear  Gauss-Seidel  Algorithm: 
repeat  { foreach  ( j  in  N )  { 

solve g/(x,‘*1,-  •  •  ,x*  + ',•  •  •  ,xjt )  “  0  for  or**1;} } 

until  (||x*  +  l  -  x*||  «  <)  (8) 

The  foreach  ( i  in  J)  construct  specifies  that  the  computa¬ 
tions  for  each  value  of  i  in  the  ordered  set  J  must  proceed 
sequentially  and  in  the  order  specified  by  the  set  For  this 

'Noic  that  examples  can  be  lound  where  Gauss-Jacobi  converges 
faster  than  Gauss-Seidel 
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method  the  actual  order  in  which  the  node  equations  are 
solved  may  be  determined  either  statically  or  dynamically, 
as  described  later  in  Subsection  III-B. 

The  nonlinear  Gauss-Jacobi  and  Gauss-Seidel  itera¬ 
tions  are  well  defined  only  if  each  equation  described  in  (7) 
and  (8)  has  a  unique  solution  in  some  domain  under 
consideration.  In  the  linear  case,  the  iterations  were  well 
defined  if  D  was  nonsingular.  In  the  nonlinear  case  we 
have  a  similar  condition.  In  addition,  the  conditions  under 
which  these  methods  converge  are  also  analogous  to  the 
ones  given  for  the  linear  case. 

Let  g'(x)  denote  the  Jacobian  of  g  computed  at  x.  Let  g 
be  continuously  differentiable  in  an  open  neighborhood  S0 
of  x  for  which  g(x)  =  0.  Let  g'(x)  be  split  as  L(x)+  D(x) 
+  U(x)  where  L(x),  D(x),  and  U(x)  are.  respectively,  the 
strictly  lower  triangular  part,  the  diagonal  part  and  the 
strictly  upper  triangular  part  of  g'(x).  Let  MCJ(x)  and 
Mcs(x)  be  defined  as  follows: 

MCJ(x)  =  -D(x)-\L(x)  +  U(x)  (9) 

and 

mCs(-*)“ -(£(■*)+£(■*))  0°) 

Assume  that  D(x)  is  nonsingular  and  that  all  the  eigenval¬ 
ues  of  MCJ(x)  and  Mcs(x )  are  inside  the  unit  circle.  Then 
there  exists  an  open  ball  S  c  Sn  such  that  the  nonlinear 
Gauss-Jacobi  and  the  Gauss-Seidel  iterations  are  well 
defined  and  for  any  jc0  e  S,  the  sequence  generated  by  the 
iterations  converges  to  x. 

This  result  assumes  that  (7)  and  (8)  can  be  solved 
exactly.  Since  these  equations  are  nonlinear,  there  is  no 
hope  of  computing  the  solutions  exactly  in  finite  time. 
Therefore  an  iterative  method  must  be  used.  In  general,  the 
Newton-Raphson  method  is  used  to  solve  these  equations. 
Note  that  for  each  relaxation  iteration,  N  decoupled  equa¬ 
tions,  each  in  one  unknown,  must  be  solved.  Thus  the 
implementation  of  the  Newton-Raphson  method  is 
straightforward.  These  “composite”  methods  are  called  the 
Gauss-Jacobi- Newton  and  Gauss-Seidel-Newton  meth¬ 
ods  to  specify  that  the  Newton  iteration  is  performed 
inside  the  nonlinear  Gauss-Jacobi  and  Gauss-Seidel  itera¬ 
tions,  respectively  [27], 

It  is  important  to  determine  when  to  stop  the  iteration  of 
the  “inner”  Newton-Raphson  loop  to  achieve  the  same 
convergence  as  in  the  ideal  case  when  the  solutions  of  (7) 
and  (8)  are  computed  exactly.  It  turns  out  rather  surpris¬ 
ingly  that  one  iteration  only  of  the  Newton  method  on  (7a) 
and  (8a)  is  sufficient  to  preserve  the  convergence  properties  of 
the  nonlinear  relaxation  methods  (27).  In  particular,  the  rate 
of  convergence  of  the  nonlinear  Gauss-Seidel  method  is 
the  same  as  the  rate  of  convergence  of  the  Gauss-Seidel- 
Newton  method.2 

Noie  lhai  rale  of  convergence  is  an  asvtnploiic  measure  of  lhe  speed 
of  lhe  algonlhm.  i.e..  of  lhe  number  of  ueralions  needed  lo  achieve  a 
given  accuracy.  Performing  addilional  ileralions  of  lhe  inner  Newion- 
Raphson  loop  may  make  lhe  ouier  retaxalion  loop  converge  in  fewer 
ileralions.  in  some  cases 


Note  that  the  convergence  result  presented  above  is  local 
in  the  sense  that  the  iterations  are  guaranteed  to  converge 
only  if  the  initial  guess  is  sufficiently  close  to  a  solution.  In 
this  respect,  the  convergence  properties  of  relaxation  meth¬ 
ods  are  similar  to  the  ones  of  the  New  ton-  Raphson  method. 
However,  the  eigenvalue  condition  of  relaxation  methods  is 
much  stronger  than  the  other  conditions  of  Newton- 
Raphson  methods.  Moreover,  the  rate  of  convergence  of 
relaxation  methods  is  only  linear  while  it  is  quadratic  for 
Newton-Raphson  methods.  This  explains  why  Newton- 
Raphson  methods  are  preferred  in  standard  circuit  simula¬ 
tion.  However,  each  iteration  of  a  relaxation  method 
involves  a  set  of  decoupled  equations  while  Newton- 
Raphson  methods  require  the  solution  of  a  set  of  simulta¬ 
neous  equations.  In  addition,  relaxation  methods  are 
ideally  suited  to  exploit  the  latency  of  the  circuit  under 
analysis  as  described  in  the  following  sections. 

A  comparison  can  be  made  of  the  use  of  relaxation 
methods  at  the  linear  and  nonlinear  equation  level.  If  the 
relaxation  methods  are  applied  at  the  linear  equation  level 
and  the  iteration  of  the  inner  relaxation  loop  is  carried  to 
convergence,  then  the  convergence  of  the  Newton  methods 
is  not  affected.  However,  if  the  inner  loop  is  not  carried  to 
convergence,  but  a  fixed  number  of  iterations  is  allowed, 
then  the  convergence  of  the  outer  Newton  loop  is  affected. 
In  fact,  if  only  one  iteration  of  the  inner  relaxation  loop  is 
taken,  then  the  convergence  of  the  “Newton-Gauss” 
methods  is  only  linear.  If  more  iterations  are  taken,  then 
the  rate  of  convergence  asymptotically  improves  to  be 
quadratic  [27],  The  use  of  relaxation  at  the  linear  equation 
level  involves  the  computation  of  the  Jacobian  of  g.  which 
is  quite  expensive  as  mentioned  earlier.  Nonlinear  relaxa¬ 
tion  methods  coupled  with  an  inner  Newton-Raphson 
loop  only  need  the  computation  of  the  partial  derivative  of 
g,  with  respect  to  x„  resulting  in  a  considerable  saving  of 
computer  time  per  iteration. 

As  in  the  linear  case,  the  Gauss-Seidel  method  tends  to 
converge  faster  than  Gauss-Jacobi.  Reordering  of  the 
equations  affects  the  speed  of  the  Gauss-Seidel  method 
crucially.  In  this  task,  the  dependency  matrix  of  (3)  plays  an 
important  role.  The  dependency  matrix  is  defined  to  be  a 
zero-one  matrix  /*  —  ( p,y  ]  such  that  pn  » 1  if  g,  depends  on 
xr  plf  -  0  otherwise.  Note  that  P  also  represents  the  zero- 
nonzero  structure  of  the  Jacobian  of  g. 

If  P  is  lower  triangular,  then  only  one  iteration  of  the 
outer  Gauss-Seidel  relaxation  loop  is  needed,  provided 
that  the  inner  Newton-Raphson  loop  is  run  to  conver¬ 
gence.  If  P  is  not  lower  triangular,  but  the  dependency  of 
the  g,  component  of  g  on  x  ,j<i,  is  “weak,”  then  the 
Gauss-Seidel  method  converges  rather  quickly.  Then  a  key 
issue  in  applying  relaxation  techniques  to  the  solution  of 
circuit  equations  is  the  reordering  of  the  equations  so  that 
P  is  almost  lower  triangular.  This  task  can  be  performed 
both  statically  and  dynamically,  as  described  in  the  next 
section.  Since  MOS  devices  are  almost  unidirectional  from 
gate  to  drain  and  gate  to  source  due  to  the  electrical 
decoupling  between  the  gate  and  the  source  and  drain  of 
the  device,  and  if  all  capacitors  used  in  the  simulation  have 
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one  node  tied  to  a  ground  and  the  circuit  does  not  contain 
any  MOS  transmission  gates  and  no  feedback  connections, 
then  the  equations  can  be  reordered  statically  to  yield  a 
lower  triangular  P  This  properly  provides  an  intuitive 
explanation  as  to  why  relaxation  methods  are  successful 
for  the  simulation  of  MOS  digital  circuits. 

£  Conclusions 

To  conclude  this  preliminary  section,  note  that  in  Fig.  3 
there  is  a  “hole”  in  the  relaxation  counterpart  of  the  flow 
diagram  of  standard  circuit  simulation  at  the  differential 
equation  level.  Until  recently,  relaxation  techniques  had 
been  used  only  at  the  linear  and  nonlinear  equation  levels. 
The  waveform  relaxation  method,  presented  in  Section  V, 
fills  that  gap. 

III.  Timing  Simulation 

A  Introduction 

The  first  successful  application  of  relaxation  methods  to 
electrical-circuit  analysis  was  in  timing  simulation  (9]-[12). 
In  timing  simulation,  only  one  relaxation  iteration  is  per¬ 
formed  per  time  step  while  one  or  more  Newton-  Raphson 
iterations  may  be  performed  to  solve  each  nodal  equation. 
Since  the  relaxation  loop  is  not  taken  to  convergence,  a 
small  time  step  must  be  used  to  bound  local  errors  and  the 
saturating  properties  of  digital  MOS  circuits  are  exploited 
to  bound  error  propagation.  Timing  simulators  have  proved 
successful  when  applied  to  constrained  IC  design  methods, 
such  as  standard  cell  (31]  or  gate  array,  but  have  not  been 
as  successful  in  the  custom-design  environment.  Since  there 
is  no  way  to  guarantee  accuracy  for  an  arbitrary  connec¬ 
tion  of  MOSFET s  unless  at  least  two  relaxation  iterations 
are  performed  per  time  step,  timing  simulators  have  pro¬ 
duced  incorrect  results  in  some  situations.  A  circuit  design¬ 
er  will  use  a  program  that  gives  the  correct  simulation 
result  and  occasionally  gives  no  result  (e.g.,  no  convergence 
at  a  time  point).  A  circuit  designer  soon  loses  confidence  in 
a  program  that  occasionally  gives  an  incorrect  answer! 
Many  timing  simulators  that  were  developed  in-house  in 
industry  arc  no  longer  in  use  although  where  they  do 
remain  in  use,  they  continue  to  be  very  successful.  When 
used  correctly,  timing  simulators  can  provide  over  two 
orders  of  magnitude  speed  improvement  over  conventional 
circuit  simulators  for  comparable  waveform  accuracy. 

As  described  in  detail  later,  timing  simulation  has  prob¬ 
lems  analyzing  circuits  containing  tight  feedback  loops, 
pass  transistors  or  floating  elements .5  In  particular,  floating 
capacitors  are  not  handled  satisfactorily.  Early  timing 
simulators  avoided  the  problem  of  analyzing  circuits  with 
floating  capacitors  by  not  allowing  the  user  to  include 
them  in  the  circuit  description.  Hence,  it  is  assumed  here 
that  the  nodal  capacitance  matrix  is  diagonal,  that  it  is 
nonsingular  for  the  entire  range  of  node  voltages  of  interest 
(this  implies  nonzero  grounded  capacitances),  and  that  the 

'A  floanng  elemenl  is  •  iwo-terminal  capaciior  or  resisior  whose 
terminals  are  nol  ground  or  power  supply. 


circuit  equations  are  written  as 

('  =  -  C(v,u)~' f(v.u)  =  -  F(i.u).  (11) 

Algorithms  used  for  liming  analysis  often  discretize  the 
derivative  operator  by  Backward  Euler  [9],  [10].  [30]  or 
the  Trapezoidal  Rule  [29].  For  the  sake  of  simplicity,  in  the 
following  description  the  Backward  Euler  formula  will  be 

used: 


^  +  i 

where  the  time  step  h  =  tk  +  \  -tk  and  vk+t  and  vk  are  the 
computed  values  of  the  node  voltages  at  time  tk  +  l  and  tk. 
respectively.  The  solution  of  the  resulting  nonlinear  system 
of  equations 

•>*♦1  ~l'k  +  l»F(vk.l.u{ik.l))  =0  (12) 

is  then  approximated  by  one  sweep  of  a  relaxation  tech¬ 
nique. 

Program  MOTIS  [9]  used  a  modified  Gauss-Jacobi  tech¬ 
nique  which  yields  the  following  set  of  decoupled  equa¬ 
tions: 


vl-hF„{vl- 
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The  solution  of  the  decoupled  nonlinear  equations  of  (6) 
is  then  approximated  by  taking  a  single  step  of  a  regula 
falsi  iteration  [32], 

The  MOT1S-C  [10]  and  SPL1CE1  [29]  programs  use  a 
modified  Gauss-Seidel  technique.  In  SPLICE  this  tech¬ 
nique  yields 

vl.i"  v"k-hF{vk^„.u(ikt})),  n  =  1.2.---.A’ 

(14) 

where 


vk  +  \,n 


05) 


The  solution  of  (14)  is  then  approximated  by  using  one  or 
more  steps  of  the  Newton- Raphson  algorithm.  The  pro¬ 
gram  can  cut  the  time  step  locally  at  a  node  and  use  a 
number  of  small  time  steps  to  achieve  a  satisfactory  solu¬ 
tion  before  the  next  equation  in  the  relaxation  solution  is 
to  be  processed. 


B.  Network  Ordering 

Unless  some  form  of  connection  graph  is  used  to  estab¬ 
lish  a  precedence  order  for  signal  flow,  the  new  node 
voltages  will  be  computed  in  an  arbitrary  order.  As  pointed 
out  in  Section  11.  in  a  Gauss-Jacobi-based  simulator,  where 
only  node  voltages  at  t k  are  used  to  evaluate  the  node 
voltage  at  it  +  1.  the  order  of  processing  elements  will  not 
affect  the  results  of  the  analysis.  However,  substantial 
liming  errors  may  occur.  For  example,  consider  the  in¬ 
verter  chain  of  Fig.  4.  If  the  input  to  inverter  /,  changes  at 
time  tk ,  that  change  cannot  appear  at  node  (!)  before  time 
tk  * For  a  chain  of  N  inverters,  the  change  will  not  appear 
at  output  Is  before  N  time  steps  have  elapsed.  If  the  time 
step  is  very  small  with  respect  to  the  response  time  of  any 
one  inverter,  this  error  may  not  be  significant. 
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Fig  4.  Invcrler  chain  example. 

In  a  Gauss-Seidel-based  simulator,  where  node  voltages 
already  computed  at  are  made  available  for  the 
evaluation  of  other  node  voltages  at  tk  +  l,  the  order  of 
processing  elements  can  affect  the  simulator  performance 
substantially.  In  the  previous  example,  if  the  inverters  were 
processed  in  the  order  1,2.3,-  •  -,N  then  tfk  can  be  de¬ 
termined  from  vk  +  i  from  and  so  on.  The  result 
will  be  zero  accumulated  timing  error.  Should  the  nodes 
happen  to  be  processed  in  the  reverse  order.  N,  N  —  1.  — ,  1. 
then  a  timing  error  of  N  time  steps  will  occur,  the  same 
error  as  in  the  Gauss-Jacobi-Newton  iteration. 

If  it  were  possible  to  order  the  processing  of  nodes  in  the 
Gauss-Seidel-Newton  iteration  so  as  to  follow  the  flow  of 
the  signal  through  the  circuit,  the  timing  error  would  be 
kept  small.  A  signal  flow  graph  would  provide  this  infor¬ 
mation.  An  example  of  a  circuit  fragment  and  associated 
signal  flow  graph,  illustrating  the  fanins  and  fanouts  of  the 
nodes,  is  shown  in  Fig.  5.  One  way  to  generate  this  graph  is 
to  consider  the  dependency  matrix  introduced  in  Section 
II.  This  zero-one  matrix  can  be  considered  as  the  adjacency 
matrix  of  a  directed  graph  G  =  G(  X,  £),  where  X  is  the  set 
of  vertices  and  £  is  the  set  of  directed  edges  of  the  graph. 
An  edge  connects  nodes  x,  to  node  xJ  if  pl;=  1.  By  the 
definition  of  dependency  matrix,  given  a  circuit  and  its 
node  equations  written  as  in  (11),  this  graph  indicates  that 
if  an  edge  connects  jc,  to  x ;,  then  the  voltage  of  node  vt, 
can  affect  the  value  of  the  voltage  of  node  j,Oj  via  the 
device  equation.  Thus  the  set  of  vertices  that  are  connected 
by  an  edge  going  in  x,  identifies  all  the  nodes  in  the  circuit 
that  affect  the  value  of  the  voltage  of  node  i.  These  are 
called  fanin  nodes  of  node  i.  Similarly,  the  set  of  nodes  that 
are  connected  to  node  jc,  by  an  edge  going  out  of  x, 
identifies  all  the  nodes  in  the  circuit  whose  voltage  is 
affected  by  the  voltage  of  node  /.  These  are  called  fanout 
nodes  of  node  /.  Note  that  a  node  can  be  both  a  fanin  and  a 
fanout  node  of  node  i. 

Famn  and  fanout  elements  can  also  be  defined.  The 
fanin  elements  of  node  /'  are  defined  as  those  which  play 
some  part  in  determining  the  voltage  at  node  /,  i.e„  those 
elements  that  cause  some  entries  of  row  i  in  the  depen¬ 
dency  matrix  to  be  one.  For  example,  any  MOS  transistor, 
modeled  by  the  simple  Shichman-Hodges  [33]  equations 
shown  in  Fig.  6,  whose  drain  or  source  is  connected  to  a 
node  would  be  classified  as  a  fanin  element  at  that  node 
since  its  drain  or  source  current  may  affect  the  node 
voltage. 

A  fanout  element  of  node  i  is  one  whose  operating 
conditions  are  directly  influenced  by  the  voltage  at  node  /, 
i.e.,  those  elements  that  cause  some  entries  of  column  /  in 
the  dependency  matrix  to  be  one.  For  MOS  transistors, 
connection  to  any  of  the  three  independent  ports  (drain, 
gate,  or  source)  would  cause  that  MOS  transistor  to  be 
included  in  the  fanout-element  set  at  the  node.  It  is  there¬ 
fore  possible  for  an  element  to  appear  as  both  a  fanin  and 
a  fanout  at  the  node. 


(b)  (I) 


Fig.  5  (a)  Circuit  fragment  (b)  Associated  signal  flow  graph. 
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Fig.  6.  (a)  n-channel  MOS  lransisior  (b)  Simple  Shichman-Hodges 
n-channcl  MOS  model  equations. 

SPLICE1  builds  the  signal  flow  graph  by  constructing 
two  tables  for  each  node  as  the  circuit  is  read.  First,  all 
circuit  elements  are  classified  as  fanin  and/or  fanout  ele¬ 
ments  of  the  nodes  to  which  they  are  connected.  The  two 
tables  constructed  for  each  node  contain  the  names  of  the 
fanin  and  fanout  elements  at  the  node.  These  tables  are 
generated  as  the  elements  are  read  into  memory.  If  this 
graph  is  acyclic,  then  it  can  be  levelized  [12,  p.  427],  where 
a  level  number  corresponding  to  the  longest  path  (most 
branches  traversed)  from  any  independent  source  to  a  node 
is  assigned  to  each  node.  If  the  graph  docs  contain  cycles, 
special  steps  must  be  taken  to  break  these  feedback  loops. 
Then  if  the  nodes  are  processed  in  the  order  of  level 
numbers,  it  is  clear  that  an  optimal  static  ordering  for 
Gauss-Seidel  processing  will  be  achieved.  An  ordering  can 
also  be  achieved  by  finding  the  strongly  connected  compo¬ 
nents  of  the  graph  [34J.  The  levelized  graph  provides  a 
static  ordering  of  the  network. 

Whenever  the  voltage  at  a  node  changes,  it  is  possible  to 
schedule  all  of  its  fanouts  to  be  processed.  In  this  way,  the 
effect  of  a  change  at  the  input  to  a  circuit  may  be  traced  as 
it  propagates  to  other  circuit  nodes  via  the  fanout  tables, 
and  thus  via  the  circuit  elements  which  are  connected  to 
them.  Since  the  only  nodes  processed  are  those  which  are 
affected  directly  by  the  change,  this  technique  is  selective 
and  hence  its  name:  selective  trace.  If  a  selective-trace 
algorithm  is  used  with  the  fanin  and  fanout  tables,  the 
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order  in  which  the  node  voltages  aie  updated  becomes  a 
function  of  the  signals  flowing  in  the  network,  and  is 
therefore  a  dynamic  ordering.  This  approach  is  often  used 
in  modern  logic  simulators  and  is  also  the  ordering  tech¬ 
nique  used  in  the  SPLICE  program. 

E\en  with  selective  trace  some  timing  errors  can  occur. 
For  example,  wherever  feedback  paths  exist,  one  time  step 
of  error  may  be  introduced.  Consider  the  circuit  fragment 
and  its  associated  signal  flow  graph  shown  in  Fig.  5. 
Assume  i>,  and  v2  are  such  that  both  A/,  and  M2  are 
conducting.  If  a  large  input  appears  at  node  (1)  at  time 
lk*\<  W*H  be  traced  through  nodes  (1),  (2),  (3),  and  (4), 
respectively.  Now,  however,  the  change  in  voltage  at  node 
(4)  caused  node  (1)  to  be  marked  to  be  processed  again  at 
this  time.  Since  only  one  sweep  of  Gauss-Seidel  is  being 
used,  the  solution  of  node  (1)  a  second  time  would  be 
illegal.  Rather,  the  node  (1)  is  scheduled  to  be  processed 
one  time  step  in  the  future,  and  thus  it  is  possible  that  one 
time  step  of  timing  error  has  been  introduced. 

C.  Exploiting  Latency 

As  mentioned  earlier,  large  digital  circuits  are  often 
relatively  inactive.  A  number  of  schemes  can  be  used  to 
avoid  the  unnecessary  computation  involved  in  the 
reevaluation  of  the  voltage  at  nodes  which  are  inactive  or 
latent. 

A  scheme  used  in  many  electrical  simulators  is  the 
"bypass”  scheme,  described  in  Section  II  for  conventional 
circuit  simulators.  This  scheme  has  also  been  employed  in  a 
number  of  timing  simulators.  However,  when  the  majority 
of  the  nodes  in  the  circuit  are  latent,  the  task  of  simply 
checking  each  node  to  determine  if  it  can  be  bypassed  can 
dominate  the  total  run  time. 

The  use  of  the  selective-trace  technique  for  dynamic 
ordering  can  provide  a  major  time  saving  here.  By  con¬ 
structing  a  list  of  all  nodes  which  are  active  at  a  time  point 
and  excluding  those  which  are  not,  selective  trace  allows 
circuit  latency  to  be  exploited  without  the  need  to  check 
each  node  for  activity.  The  elimination  of  this  checking 
process,  used  in  both  the  bypass  approach  and  the  static 
levelizing  scheme  described  previously,  can  save  a  signifi¬ 
cant  amount  of  computer  time  for  large  circuits  at  the  cost 
of  some  extra  storage  for  the  fanin  and  fanout  tables  at 
each  node. 

In  an  efficient  implementation  of  the  selective-trace  tech¬ 
nique,  the  fanin  and  fanout  tables  do  not  contain  the 
“names”  of  fanin  and  fanout  elements,  respectively,  but 
rather  a  pointer  to  the  actual  location  in  memory  where  the 
data  for  each  element  is  stored. 

D  Numerical  Properties  of  Timing  Algorithms 

A  major  drawback  with  the  use  of  timing  analysis  is  that 
tightly  coupled  feedback  loops,  or  bidirectional  circuit 
elements,  can  cause  severe  inaccuracies  and  even  instability 
during  the  analysis.  For  example,  if  the  Gauss-Seidel 
“one-sweep"  timing-analysis  method  is  applied  to  the  cir¬ 
cuit  of  Fig.  7  limiting  the  time-step  to  0.1  s,  the  waveforms 
of  Fig.  8(a)  are  obtained.  However,  if  the  time  step  is  set  to 
0.8  s,  then  the  computed  solution  blows  up  as  shown  in 


G1  -G2- 1  mho;C1  -C2«  1  r;Gm1  -  1 5mho 
Fig  7.  Schematic  d  agram  of  the  example  circuit. 
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Fig  R  (a)  Accurate  waveform  of  voltage  t’i  computed  with  a  0.1-s 
time-step  (b)  Waveform  of  voltage  r,  computed  with  0.8-s  time-step. 

Fig.  8(b).  This  demonstrates  that  timing  algorithms  do  not 
inherit  the  numerical  properties  of  the  discretization  for¬ 
mulate  used  to  approximate  the  time  derivative.  In  fact, 
Backward  Euler  is  well  known  to  be  A  stable,  i.e„  the 
computed  solution  of  the  circuit  differential  equations 
should  not  “blow  up”  independent  of  the  choice  of  time 
step  as  long  as  the  simulated  circuit  is  stable. 

The  reason  why  this  idiosyncrasy  is  observed  is  that 
timing  algorithms  do  not  solve  (5)  since  only  one  sweep  of 
the  relaxation  iteration  is  taken.  Therefore  the  stability  and 
accuracy  properties  of  the  integration  method  used  to 
discretize  the  derivative  operator  no  longer  hold  As  a 
matter  of  fact,  the  combination  of  the  discretization  for¬ 
mula,  the  various  relaxation  steps,  and  the  Newton- 
Raphson  method  form  a  set  of  new  integration  algorithms 
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These  integration  methods  use  an  implicit  formula  to  dis¬ 
cretize  the  differential  equations,  but  they  do  not  solve  the 
nonlinear  equation  obtained  Thus  they  are  somewhat  in 
between  explicit  and  implicit  methods. 

In  the  following  description,  the  “time-advancement'' 
algorithms  which  use  the  Gauss-Jacobi  and  the  Gauss- 
Seidel  relaxation  step  will  be  referred  to  as  Gauss-Jacobi 
and  Gauss -Seidel  integration  algorithms,  respectively.  This 
perspective  allows  the  understanding  of  the  numerical  be¬ 
havior  of  timing  algorithms  and  the  development  of  better 
techniques  for  timing  simulation. 

An  analysis  of  numerical  properties  of  the  Gauss-Jacobi 
and  Gauss-Seidel  integration  algorithms  when  applied  to 
MOS  circuits  has  been  carried  out  in  [35],  (36],  Only  the 
most  important  results  are  outlined  here.  First  consider  the 
case  where  no  floating  capacitors  are  present  in  the  circuit 
to  be  analyzed. 

The  numerical  properties  of  an  integration  method,  such 
as  stability,  are  studied  on  test  problems  [37],  [38],  which 
are  simple  enough  to  allow  a  theoretical  analysis  but  still 
sufficiently  general  that  some  insight  can  be  obtained 
about  how  the  method  will  behave  in  general.  For  the 
widely  used  linear  multistep  methods,  the  test  problem 
consists  of  a  linear  time-invariant  asymptotically  stable 
autonomous  differential  equation.  Unfortunately  this  sim¬ 
ple  test  problem  cannot  be  used  to  evaluate  relaxation-based 
time-advancement  techniques.  In  fact,  each  variable  of  the 
system  of  differential  equations  is  treated  differently 
according  to  the  ordering  in  which  equations  are  processed. 
Hence  a  more  complex  test  problem  is  needed.  The  test 
problem  chosen  here  is  a  linear  time-invariant  asymptoti¬ 
cally  stable  system  of  autonomous  differential  equations, 
i.e.. 

x  -  Ax 

x(0)=T  (16) 

where  A  e  RnX"  and  the  set  of  eigenvalues  (spectrum)  of  A, 
o(/l),  is  in  the  open  left-half  complex  plane,  i.e.,  o(A)  e  C0  . 

In  circuit  theoretic  terms,  linear  circuits  whose  natural 
frequencies  are  in  the  open  left-half  plane  and  which 
satisfy  the  assumptions  described  in  Section  II  are  consid¬ 
ered  as  test  circuits.  Let  A  =  L  +  D  +  U,  where  L  is  strictly 
lower  triangular.  D  is  diagonal,  and  U  is  strictly  upper 
triangular.  The  time-advancement  methods  presented  in 
Section  111  applied  to  the  test  system  of  (16)  yield  the 
following  recursive  relations: 


Gauss  -  Jacobi  integration  algorithm; 

[  /  —  hD]xk  , ,  =  [  /  +  h(  L  +  U)\xk  (17) 
■**  *i  =  M(l-j(h)xk  (18) 

where  /  is  the  identity  matrix  and 

McAh)-{l-hD\-'[l  +  h(L  +  U)\,  (19) 

Gauss  -  Seidel  integration  algorithm: 

[l-h(D+L)]xk,l-[l  +  hU\xk  (20) 

”  Mcs(h)xk  (21) 


i  iransai  riONs  on  mi  iron  nivu  is.  voi..  i i>  jo.  no  sipiimbir  1WJ 
where 

•V«s(A)-l/-A<0+L)]  '[1  +  hU  1  (22) 

The  matrices  Mi;j(h)  and  Mas(h)  are  called  the  com¬ 
panion  matrices  of  the  methods.  If  the  generic  companion 
matrix  of  a  method  is  denoted  M(h).  then 

•«* -["(*)]**„.  (23) 

The  numerical  properties  of  the  integration  algorithms 
described  by  (23)  are  now  described  following  the  outline 
of  one-step  integration  methods  applied  to  ordinary  dif¬ 
ferential  equations  [37], 

The  first  numerical  property  of  the  integration  methods 
to  investigate  is  accuracy.  This  property  relates  the  error 
introduced  by  the  discretization  process  and  the  time  step. 

Definition  lll-D-l:  Let  x{tk)  be  the  exact  value  of  the 
solution  of  the  test  problem  at  time  tk .  Let  xk  be  the 
computed  solution  at  time  tk  assuming  x4_,  -  x(rk^t),  i.e., 
that  no  error  has  been  made  in  computing  the  value  of  x  at 
the  previous  time  point.  If  h  =  tk  —  tk  _ ,,  the  local  trunca¬ 
tion  error  is  defined  to  be 

«-ii*(/*)-.*An.  (24) 

If  <  **  0(hr*'),  r  is  said  to  be  the  order  of  the  integration 
method  [37],  ■ 

It  has  been  observed  experimentally  that  if  the  time  step 
is  decreased,  the  accuracy  of  the  solution  computed  by 
timing  algorithms  improves  in  almost  the  same  way  as 
Backward  Euler.  In  fact,  the  Gauss-Jacobi  and  the 
Gauss-Seidel  integration  algorithms  have  the  same  accu¬ 
racy  as  the  Backward  Euler  integration  method. 

Theorem  lll-D-2:  Gauss-Jacobi  and  Gauss-Seidel  in¬ 
tegration  algorithms  are  first  order  integration  algorithms. 

■ 

In  circuit  analysis,  another  important  criterion  for 
evaluating  the  accuracy  of  an  integration  method,  can  be 
defined  as  waveform  accuracy.  In  general,  the  computed 
solution  of  a  system  of  differential  equations  is  the  super¬ 
position  of  a  principal  solution  and  associated  parasitic 
solutions.  Parasitic  solutions  are  generated  by  the  numeri¬ 
cal  approximations  of  the  integration  methods.  In  partic¬ 
ular.  an  nth  order  integration  algorithm  yields  n- 1 
parasitic  solutions  when  applied  to  the  test  problem.  For 
the  algorithms  under  consideration  in  this  paper,  the  dis¬ 
placement  techniques  introduce  additional  spurious  com¬ 
ponents  called  numerical  solution  components. 

If  the  original  system  to  be  analyzed  does  not  contain  an 
oscillatory  component,  the  presence  of  such  a  component 
in  the  computed  solution  can  be  misleading  in  the  evalua¬ 
tion  of  the  performances  of  the  system.  As  was  shown  in 
the  previous  subsection,  the  Gauss-Seidel  integration  algo¬ 
rithm  introduces  spurious  oscillations  in  the  computed 
solution  of  the  equations  describing  a  circuit  where  the 
exact  solution  does  not  have  any  oscillations.  It  is  neces¬ 
sary  to  introduce  methods  that  allow  the  evaluation  of  the 
“waveform  accuracy"  of  the  integration  methods.  To  this 
end.  a  subclass  of  the  test  problem  is  now  introduced, 
characterized  by  o(A)e  R0  .  i.e.,  the  set  of  test  problems 
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which  does  not  have  an  oscillatory  component  in  the 
solution,  and  bounds  on  the  oscillators  components  of  the 
computed  solutions  must  be  established. 

Theorem  III-D-2  provides  a  bound  on  the  oscillatory 
components  of  all  the  methods.  In  particular  it  is  clqar 
that,  by  choosing  an  appropriately  small  step  size  h.  the 
numerical  solution  oscillatory  components  can  be  made 
negligible  with  respect  to  the  principal  solution. 

Another  fundamental  property  of  integration  fnethods  is 
stability.  Stability  can  also  be  defined  in  terms  of  the  test 
problem. 

Definition  III-D-3  (Stability):  An  integration  algorithm  is 
stable  if  36  >  0.3A'  >  0  such  that  Vx0  €  R”,3k  >  0  and 

||.vj|  <  N.  VAr>A~  Wie[0.6)  (25) 

where  .vt  is  the  sequence  generated  by  the  algorithm  ap¬ 
plied  to  the  test  problem  according  to  (23).  ■ 

It  is  obvious  that  a  numerical  method  that  is  not  stable  is 
of  no  practical  use.  It  happens  that  both  relaxation-based 
integration  algorithms  are  stable  as  stated  in  the  following 
theorem  which  is  proven  in  (35). 

Theorem  III-D-4:  Gauss-Jacobi  and  Gauss-Seidel  in¬ 
tegration  algorithms  are  stable.  ■ 

The  accuracy  and  the  stability  of  the  integration  algo¬ 
rithms  explain  the  success  of  timing  simulators,  but  they 
also  point  out  what  the  problems  are  with  their  use.  In 
particular,  note  that  6  of  Definition  III-D-3  can  be  quite 
small,  i.e.,  that  the  time  step  may  have  to  be  reduced  not 
for  accuracy  reasons,  but  to  make  sure  that  the  computed 
solution  does  not  blow  up.  Moreover,  it  is  difficult  to 
identify  oscillations  in  the  computed  solutions  as  spurious. 

To  cope  at  least  in  part  with  these  problems,  another 
displacement  technique  for  the  solution  of  (1)  has  been 
proposed  for  a  simple  circuit  in  (42).  This  algorithm  is  a 
symmetric-displacement  method  reminiscent  of  the  alter¬ 
nating-direction  implicit  method  [32]  and  is  based  on  a 
class  of  methods  proposed  by  Kahan  (39).  The  basic  idea 
here  is  to  “symmetrize"  the  Gauss-Seidel  scheme  with  a 
method  that  takes  two  half-steps  of  size  jj  each:  one 
half-step  is  taken  in  the  usual  "forward"  (i.e.,  lower  trian¬ 
gular)  direction,  the  second  half-step  is  taken  in  the  back¬ 
ward  (i.e.,  upper  triangular)  direction. 

This  method  is  introduced  with  the  help  of  the  linear 
system  described  by  (16).  The  first  half-step  corresponds  to 
the  Gauss-Seidel  method.  Let  A  «*  L  +  D  +  U.  then 

(26) 

Note  that  there  is  a  difference  between  (20)  and  (26)  since 
D  has  been  split  into  two  parts  here.  This  splitting  of  D  is 
necessary  to  "symmetrize”  the  method.  The  backward 
half-step  is  then 
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Consider  the  simple  example  of  Fig.  9.  The  first  half-step 


tlv.t 


Fif.  Simple  circuit  lo  illustrate  the  modified  ss nmietric  Gauss  Seidel 
method 


yields 


h  ,  h  +  \  i  1 1  _  fl<L lift  Li  |  h  Gy 

l1  4  c,  V  4  c,  r  *  +  2  C, A* 


h  G-,  +  Gy 

'  +  -4-cT 


2  h  Gy  1 

xt  +  \/2  “  2  O,  1/2 


+  1--7 


h  Gs  +  Gy  \  , 


4  C: 

The  backward  step  involves  the  solution  of 
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Cj  J^-t  ~  2  C2  **l/2 


/  A  CjjbC,)  ,  mhG1  2 

\  4  C,  )X**'  2  C,  X*  +  l 


y  4  C, 


1/2- 


If  these  formulas  are  generalized  for  the  nonlinear  case, 
and  with 

[n).---.v,l,v'i:\/2r--,v'l'.l/2]T  if  2/ is  odd  (28) 
[t>J_  i/2,'  ‘ '  <v't-\n'  v'i'"  '  •*'”] r  if  2/ is  even.  (29) 
the  forward  step  yields 

vi*l/2  ~  *4  +  4^i(*5A  +  l/2.«>u(,*  +  l/2)) 

+  ^  ^(•'A  +  l/J.i-l*  H(f*  +  l/j))  “  0, 

/*=1,2.-  •  ■  .N  (30) 

and  the  backward  step  yields 

+  4  Fi(l\  *  km  i-  i ))  “  0. 

i-N.N- 1.---.1  (31) 

The  solution  of  the  decoupled  equations  is  then  approxi¬ 
mated  by  taking  one  step  of  the  Newton-Raphson  algo¬ 
rithm. 

This  method  can  be  proven  to  be  more  accurate  and 
more  stable  than  the  previous  one.  The  additional  work 
required  to  perform  the  intermediate  step  is  compensated 
by  the  additional  accuracy  as  specified  by  the  following 
theorem. 
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Theorem  lll-D-5.  The  modified  symmetric  Gauss-Seidel 
algorithm  is  a  second-order  integration  algorithm.  ■ 

In  addition,  the  ’“waveform  accuracy"  of  this  method  is 
better  than  that  of  the  other  integration  algorithms.  If  the 
class  of  the  test  problems  is  restricted  to  the  subclass 
characterized  by  a  symmetric  A  matrix,  then  a  strong  result 
for  the  modified  symmetric  Gauss-Seidel  integration 
method  can  be  obtained.  In  circuit  theoretic  terms,  this 
analysis  applies  to  linear  circuits  whose  node  equations 
yield  a  symmetric  nodal  admittance  matrix  when  only  the 
resistive  part  of  the  circuit  is  considered.  Moreover  it  is 
required  that  this  matrix  remain  symmetric  when  premulti¬ 
plied  by  C '  the  diagonal  matrix  of  the  grounded  capaci¬ 
tors.  A  sufficient  condition  for  this  to  occur  is  that  the 
circuit  consists  of  two  terminal  linear  resistors  and  capaci¬ 
tors  and  that  the  grounded  capacitors  be  of  equal  value. 
The  case  where  the  capacitors  are  not  of  equal  value  can 
also  be  included  in  this  qlass  provided  that  a  scaling  of  the 
rows  of  the  matrix  is  performed. 

Theorem  III-D-6:  If  A  is  a  real,  symmetric  matrix,  the 
spectrum  of  the  companion  matrix  of  the  modified  sym¬ 
metric  Gauss-Seidel  integration  method  is  real.  i.e..  no 
oscillatory  parasitic  components  are  present  in  the  com¬ 
puted  solution.  ■ 

This  theorem  guarantees  that  the  time  step  does  not  have 
to  be  limited  to  eliminate  spurious  oscillations  at  least  for 
reciprocal  circuits.  Numerical  results  obtained  with  an 
experimental  timing  simulator  show  that  spurious  oscil¬ 
lations  do  not  appear  in  the  solution  of  nonlinear  nonre¬ 
ciprocal  circuits  as  well  (36],  [40]. 

For  computational  efficiency,  it  is  desirable  that  the  step 
size  be  limited  only  by  accuracy  considerations  as  in  the 
case  of  the  implicit  backward  differentiation  formulas  (37). 
In  the  case  of  classical  multistep  methods,  the  concept  of 
-stability  [38]  and  stiff-stability  [37]  have  been  introduced 
to  test  the  "unconditional"  stability  of  multistep  methods. 
For  the  "time-advancement"  techniques  presented  in  this 
paper,  it  makes  sense  to  define  a  similar  concept.  Unfor¬ 
tunately.  general  results  of  "unconditional"  stability  are 
not  available  for  the  test  problem  defined  previously,  but 
only  for  a  subclass;  once  more  the  subclass  characterized 
by  a  symmetric  A  matrix. 

Definition  lll-D-l  (A  stability):  An  integration  method  is 
A  stable  if  3N  >  0  such  that  Vx0  e  R".3k 

||jrJ</V.  Vk>k,  Vh  e  [0, »)  (32) 

where  { xk )  is  the  sequence  generated  by  the  method 
applied  to  the  test  problem  of  (11 )  with  A  symmetric.  ■ 
Theorem  lll-D-8:  The  modified  symmetric  Gauss-Seidel 
method  is  A  stable.  ■ 

Note  that  no  A  stability  result  for  the  Gauss- Jacobi  and 
the  Gauss  -Seidel  integration  methods  has  been  proven.  In 
our  practical  experiments,  we  have  seen  that  when  applied 
to  real  circuit  problems  the  modified  symmetric  Gauss- 
Seidel  method  is  indeed  “  more  stable"  than  the  other  two 
methods. 


Fig  10  (a)  BooDlrap  inverler  circun  and  (b)  effect  of  C,,n  feed-through 

E.  Floating  Capacitors 

As  mentioned  in  the  introduction  to  this  section,  a 
circuit  element  that  has  limited  the  application  of  timing 
analysis  is  the  floating  capacitor  [41],  [42]. 

The  floating  capacitor  is  often  an  important  element  in 
the  design  of  integrated  circuits.  In  Fig.  10(a)  the  value 
of  the  bootstrap  capacitor  C„  is  generally  large  compared 
with  the  values  of  the  associated  parasitic  grounded  capaci¬ 
tors  C,  and  C2.  The  value  of  the  intrinsic  gate-drain 
feedthrough  capacitance  in  Fig.  10(b)  is  often  small 
compared  with  other  circuit  parasitics  at  the  gate  and  drain 
nodes;  however,  the  effect  of  C%i  on  circuit  performance 
can  be  significant  due  to  the  large  voltage  gain  of  the  stage. 

When  floating  capacitors  are  present  in  the  circuit  to  be 
analyzed,  the  timing  simulation  algorithms  presented  in  the 
previous  sections  take  a  different  form.  For  the  sake  of 
simplicity,  consider  a  linear  time-invariant  circuit  described 
by 

Cv  -  -  Gv,  v(0)  -  V  (33) 

where  C  is  the  node  capacitance  matrix  and  G  is  the  node 
conductance  matrix.  If  C  is  inverted,  the  methods  de¬ 
scribed  previously  apply.  However,  inverting  C  is  expensive 
and  most  of  the  advantages  of  timing  simulation  algo¬ 
rithms  would  be  lost.  Thus  if  the  Backward  Euler  formula 
is  used  to  discretize  the  circuit  equations  at  time  tk+l,vk<.l 
is  given  by 

~hG°k*i  (34) 

or,  rearranging  (34),  vk4.,  is  given  by 

(C  +  hG)vktl  mCvk.  (35) 

Let  C  be  split  as  Cd  +  C,  +  Cu  and  G  as  Gd+G,+  Gu, 
where  Cd  and  Gd  are  diagonal  matrices,  C,  and  G,  arc 
strictly  lower  triangular  matrices,  and  C„  and  G„  arc  strictly 
upper  triangular  matrices.  Then,  the  time-advancement 
Gauss-Jacobi  algorithm  for  circuits  described  by  (33)  be¬ 
comes 

(Cd  +  hGd)vk.l-(C-C,-Cu- HG,  -  hGu)vk 

-( Cd-h(Gt  +  Gm))vk  (36) 
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and  the  Gauss-Seidel  time-advancement  algorithm  is 

(CJ  +  Cl  +  h(GJ  +  Gl))vk.i-(C-C„-hGu)rk 

~(C,-hGa)vl  (37) 

and  finally  the  modified  symmetric  Gauss-Seidel  algo¬ 
rithm  is 

+  +  (38a) 

(^-FC.  +  f^  +  G,.))^, 

“(^Q  +  Q-^^G.  +  G,)).  (38b) 

When  these  algorithms  are  applied  to  circuits  where  C  is 
not  diagonal,  serious  stability  and  accuracy  problems  may 
arise.  In  the  example  of  Fig.  7,  the  Gauss-Seidel  integra¬ 
tion  algorithm  with  a  time  step  equal  to  0.6  s  computes  the 
solution  shown  in  Fig.  11,  with  oscillations  that  are  not 
present  in  the  accurate  solution  of  the  circuit  equations 
shown  in  Fig.  8(a).  In  addition,  these  methods  are  not  even 
consistent.  That  is,  when  the  time  step  h  is  reduced  to  0, 
the  sequence  of  voltages  computed  according  to  (36)  or  to 
(37)  does  not  converge  to  the  solution  of  (33).  In  fact,  since 
(36)  is  the  same  as  the  equation  obtained  by  applying  the 
Gauss-Jacobi  algorithm  to  a  circuit  where  only  grounded 
capacitors  are  present,  the  effect  of  the  floating  capacitors 
is  completely  neglected  (note  that  in  (36),  C,  and  C„  do  not 
appear).  The  Gauss-Seidel  algorithm  neglects  C„  only,  and 
hence  is  more  accurate  than  the  Gauss-Jacobi  algorithm. 

The  modified  symmetric  Gauss-Seidel  integration  algo¬ 
rithm  presented  in  the  previous  subsection  has  been  proven 
to  be  accurate,  stable,  and  even  A  stable,  but  only  for  the 
particular  classes  of  circuits  characterized  by  G  and  C 
matrices  with  appropriate  mathematical  properties  [36]. 
However,  the  modified  symmetric  Gauss-Seidel  algorithm 
is  not  consistent  in  the  general  case  either.  On  the  other 
hand,  since  Cu  is  neglected  in  the  first  half-step  while  C,  is 
neglected  in  the  second  half-step,  the  effect  of  the  floating 
capacitors  on  the  solution  of  (33)  is  modeled  more  pre¬ 
cisely  than  in  the  other  methods,  and  better  accuracy  is 
obtained  as  a  consequence  [36]. 

Early  timing  simulators  avoided  the  problem  of  analyz¬ 
ing  floating  capacitors  by  not  allowing  the  user  to  include 
them  in  the  circuit  description.  The  effect  of  a  floating 
capacitor  may  then  be  approximated  by  altering  the  values 
of  the  grounded  capacitors  at  appropriate  nodes  in  the 
circuit.  If  the  operation  of  a  circuit  depends  on  a  floating 
capacitor,  a  functional  macromodel  may  be  used  (e.g.. 
19]— [1 1])  where  the  effect  of  the  floating  element  is  hidden 
from  the  relaxation  iteration  by  special  processing  of  the 
circuit  fragment  in  which  it  is  embedded,  perhaps  involving 
local  matrix  solution  [29]. 

Another  approach  called  the  Implicit-Implicit-Explicit 
(HE)  method  has  been  proposed  [41]  for  circuits  with 
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floating  capacitors.  This  method  can  be  generalized  and 
explained  using  (33).  Let  (33)  be  rearranged  as 

{Cj  +  C,)b'  +  Cjb"  —  -  Gt>,  u(0)  “  V  (39) 

where  i/  -  b"  -  b.  The  Backward  Euler  method  is  used  to 
approximate  b1  and  Forward  Euler  is  used  to  approximate 
b“.  Then  (39)  becomes 

(C;  +  C,)(tia*,-u*)+C.(t>* -hCv^v 

(40) 

Applying  the  Gauss-Seidel  integration  scheme  to  (40)  and 
rearranging  terms,  the  IIE  equations  for  the  linear  circuit 
described  by  (33)  become 

(Crf  +  C,+  MG,  +  G,))i>**,  -  ~C„(vk  -  hGui\. 

(41) 

Note  that  the  effects  of  both  the  lower  triangular  part  and 
the  upper  triangular  part  of  the  capacitance  matrix  are  now 
taken  into  account.  However,  to  date  the  numerical  proper¬ 
ties  of  this  scheme  have  only  been  published  for  a  simple 
two-node  linear  test  circuit  [41],  [42].  The  method  has  been 
difficult  to  characterize  rigorously  since  it  involves  infor¬ 
mation  from  two  previous  time  points  (tk  and  /*_,).  Re¬ 
cently  the  stability,  consistency,  and  order  of  accuracy  of 
this  approach  have  been  determined  for  the  general  case 
[43]  and  the  method  looks  very  promising.  The  IIE  method 
is  used  in  the  SPLICE1  program  for  the  analysis  of  floating 
capacitors  [12]. 

F.  Conclusions 

Timing  simulation  algorithms  are  fast  and  rather  accu¬ 
rate  for  the  electrical  simulation  of  MOS  circuits  with  no 
tight  feedback  loops.  Hqwever,  several  stability  and  accu¬ 
racy  problems  hamper  the  use  of  timing  simulation  as  a 
standard  simulation  tool.  In  particular,  major  drawbacks  of 
timing  simulation  algorithms  are 

1)  The  selection  of  an  appropriate  step  size  is  difficult  If 
a  fixed  step  size  is  used,  heuristics  must  be  introduced  to 
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estimate  the  time  constants  of  the  circuit  (10).  If  a  variable 
step  size  is  used,  the  local  truncation  error  must  be  esti¬ 
mated  In  fact,  the  other  technique  used  to  control  step  size 
in  standard  circuit  simulators,  the  so-called  iteration  count 
method  (1).  cannot  be  applied  here  since  the  relaxation 
techniques  are  not  carried  to  convergence.  Unfortunately, 
the  local  truncation  error  cannot  be  estimated  accurately 
since  the  error  in  the  voltage  computed  at  time  tk  by  the 
timing  simulation  algorithms  is  the  sum  of  the  truncation 
error  due  to  the  integration  method  and  of  the  error  due  to 
the  inaccurate  solution  of  the  discretized  nonlinear  equa¬ 
tions.  These  two  errors  can  be  of  the  same  order  and 
sometimes  the  latter  component  can  even  be  larger  than 
the  truncation  error. 

2)  The  step  size  may  be  limited  by  stability  considera¬ 
tions  since  timing  simulation  algorithms  are  A  stable  only 
in  particular  cases.  This  may  force  the  use  of  small  step 
sizes  even  though  large  step  size  may  be  possible  from 
accuracy  point  of  view. 

3)  With  the  exception  of  the  HE  method,  timing  simula¬ 
tion  algorithms  are  not  even  consistent  when  applied  to  the 
analysis  of  circuits  containing  floating  capacitors.  This 
means  that  their  accuracy  cannot  be  improved  over  a 
certain  limit  by  further  step-size  reductions. 

All  of  these  problems  stem  from  the  fact  that  t! .  relaxa¬ 
tion  methods  are  not  carried  to  convergence.  The  fear  that 
carrying  the  relaxation  iteration  to  convergence  would  re¬ 
duce  the  speed  advantages  of  timing  simulation  prevented 
the  adoption  of  the  obvious  remedy  to  this  situation  for  a 
number  of  years.  In  the  following  sections,  techniques  and 
simulators  based  on  convergent  relaxation  methods  are 
introduced  and  shown  to  be  highly  competitive  with  stan¬ 
dard  circuit  simulators  for  accuracy  and  with  timing  simu¬ 
lators  for  speed. 

IV.  Iterated  Timing  Analysis 

A.  Introduction 

Iterated  Timing  Analysis  (ITA)  (19)  is  a  new  form  of 
electrical  analysis  which  can  be  derived  from  timing  analy¬ 
sis.  This  form  of  relaxation-based  electrical  analysis  has 
shown  promising  results  over  a  wide  class  of  circuits,  from 
large  digital  circuits  to  complex  analog  designs.  The  tech¬ 
nique  is  accurate,  fast  for  large  digital  circuits,  and  amena¬ 
ble  to  implementation  on  advanced  computer  architectures, 
such  as  vector  and  array  processors  (44]-(47)  as  well  as 
data-flow  machines  (48),  (49). 

The  starting  point  for  a  description  of  ITA  is  the  circuit 
equation  formulation  of  (2).  The  differential  equations  are 
converted  to  a  set  of  nonlinear,  algebraic  difference  equa¬ 
tions  (3)  using  a  stiffly  stable  integration  formula,  and  an 
iterative  relaxation  method  (Gauss-Jacobi  or  Gauss- 
Seidel)  is  then  used  to  solve  them.  However,  unlike  timing 
analysis  where  a  single  relaxation  iteration  is  used  per  time 
point,  in  the  ITA  approach  the  relaxation  process  is  con¬ 
tinued  to  convergence  at  a  time  point. 

Only  one  Newton-Raphson  iteration  is  used  to  ap¬ 
proximate  the  solution  of  each  nodal  equation  per  relaxa¬ 
tion  iteration  and  event-driven  selective  trace  techniques 


may  still  be  used  to  exploit  latency,  as  for  timing  simula¬ 
tion.  Thus  the  mathematical  framework  of  ITA  is  the 
nonlinear  Gauss-Seidel  (Gauss-Jacobi)  -Newton  method 
presented  in  Section  11. 

B.  Sumerical  Properties  oj  ITA 

Since  in  ITA  the  nonlinear  circuit  equations  are  solved 
by  an  iterative  method  until  satisfactory  convergence  is 
achieved,  the  numerical  properties  of  the  integration  meth¬ 
ods  used  to  discretize  the  circuit  equations  are  retained. 
Thus  the  stability  and  the  accuracy  problems  typical  of  the 
timing  simulation  algorithms  presented  in  Section  III  are 
not  an  issue.  However,  the  basic  question  is  whether  the 
relaxation  iteration  will  converge  at  each  time  point  when 
solving  the  discretized  circuit  equations. 

The  conditions  under  which  the  relaxation  iteration  is 
guaranteed  to  converge  were  presented  in  Section  11.  Note 
that  these  conditions  require  the  diagonal  dominance  of 
the  Jacobian  of  the  discretized  nonlinear  equations.  Re¬ 
turning  once  again  to  (2).  the  circuit  equations  may  be 
formulated  as 

C(v.u)i’  +  f{v.u)  =  0.  i>(0)-K  (42) 

where  C:  R"  X  Rf  -►  R"Xn  is  a  symmetric  diagonally  domi¬ 
nant  matrix-value  function  in  which  -C,,(r.  u);  i*  j  is 
the  total  floating  capacitance  between  nodes  /  and 
j.C„(v.u)  is  the  sum  of  the  capacitances  of  all  capacitors 
connected  to  node  /.  and  /:  R'xR"xR'-*R"  is  a  con¬ 
tinuous  function,  each  component  of  which  represents  the 
net  current  charging  the  capacitor  at  a  node  due  to  other 
conductive  elements.  If  the  capacitance  matrix  C(v,u )  is 
assumed  to  be  symmetric  and  positive  definite  (and  hence 
strictly  diagonally  dominant),  as  is  the  case  if  all  the 
capacitors  in  the  circuit  are  two-terminal  elements  and  are 
positive  for  all  values  of  u,  it  is  intuitive  to  see  that  the 
Jacobian  matrix  of  the  discretized  nonlinear  circuit  equa¬ 
tions  is  diagonally  dominant  provided  that  the  time  step  is 
small  enough.  In  fact,  the  time  step  is  acting  as  a  scaling 
parameter  that  increases  the  role  of  the  capacitance  matrix 
in  the  Jacobian  matrix  when  it  is  decreased.  More  formally, 
the  convergence  properties  of  ITA  can  be  proven  rather 
easily  for  circuit  equations  of  the  form  of  (42)  where 
C(v.u)  is  a  matrix  of  real  numbers,  i.e.,  the  capacitors 
present  in  the  circuit  are  all  linear.  Then  the  discretized 
equations  become 

C(vk  + 1  “  vk )~  i>  M*  +  i)  m  0  (43) 

where  hk.t  is  the  time  step  selected  at  time  tk.  The 
following  strong  Theorem  has  been  proven  in  (50). 

Theorem  IV-B-I:  There  exists  a  time  step  h  strictly 
positive  such  that  for  all  hk.l^h  the  nonlinear  Gauss- 
Jacobi  and  the  nonlinear  Gauss-Seidel  iteration  applied  to 
(43)  converge  to  the  solution  of  the  discretized  circuit 
equations  independent  of  the  initial  guess.  ■ 

C.  Implementation  of  ITA 

In  Theorem  IV-B-1,  the  value  of  h  can  be  quite  small  if 
the  Jacobian  of  /  is  not  diagonally  dominant  at  the  time 
point  of  interest  and  if  the  C  matrix  has  large  off-diagonal 


NtWTON  AND  SANGIOS ANN1-V1NCENTELLI  RELAXATION-BASED  tUCTRIC  Al  SIMULATION 


elements,  i  e..  when  large  floating  capacitors  and/or  tight 
feedback  loops  are  present  in  the  circuit.  Hence,  such  an 
iterative  method  would  not  appear  well  suited  to  the  analy¬ 
sis  of  circuits  with  strong  bilateral  coupling.  However,  an 
1TA  capability  has  been  implemented  in  the  SPL1CE1 
program  [19),  [51].  and  while  strong  bilateral  coupling  does 
increase  simulation  time,  the  correct  solution  is  obtained 
even  for  analog  circuits.  With  the  event-driven  selective 
trace  scheduling  as  implemented  in  SPL1CE1.  less  than  a 
factor  of  two  increase  in  CPU  time  has  been  observed 
compared  with  SPL1CE1  timing  simulation,  for  large  dig¬ 
ital  circuits.  For  small  tightly  coupled  MOS  analog  circuits, 
the  1TA  program  may  take  even  longer  than  SP1CE2.  as 
illustrated  in  the  next  section.  However,  for  large  in¬ 
tegrated  circuits  such  tight  coupling  is  local  to  a  small 
block  and  the  advantages  obtained  from  circuit  latencv.  as 
well  as  the  ability  to  exploit  parallel  processing  effectively, 
far  outweigh  the  disadvantages. 

The  following  algorithm  illustrates  the  principle  steps 
involved  in  1TA  analysis  for  use  on  a  conventional  com¬ 
puter.  Only  the  Gauss-Seidel  form  is  shown  here,  but  the 
Gauss- Jacobi  form  can  be  obtained  as  outlined  in  Section 
11.  At  each  time  at  which  one  or  more  nodes  are  scheduled 
to  be  processed,  two  event  lists.  EA(tn)  and  £fl(/„),  are 
used  to  separate  the  nodes  to  be  processed  in  successive 
iterations,  k  and  k  + 1.  of  the  Gauss-Seidel-Newton  pro¬ 
cess. 

Gauss-Seidel  Iteration: 

put  all  nodes  that  are  connected  to  independent  sources 

in  event  list  £A(0): 

'*-0: 

while  (t„  <  TSTOP){ 
k*-  0; 

while  (event  list  EA(tn)  is  not  empty ){ 
foreach  (/  in  £,,(/„)){ 

obtain  r*  +  1  from  •  •.0*41.  -  ••,»*)“  0 

using  a  single  Newton- Raphson  step; 

if  (|c*  + 1  -  o*|  <  <;  i.e.  convergence  is  achieved)) 
use  LTE  to  determine  the  next  time.  /,  for 
processing  node  /; 
add  node  i  to  event  list  £4(/,); 

} 

else) 

add  node  /  to  event  list  EB(t„)\  add  the 
fanout  nodes  of  node  i  to  event  list EA(t„)  if 
they  are  not  already  on  EA(t„)\ 

} 

} 

EaC„)-  E„{tn);  E„(tn)*-  empty; 
k  *-  k  + 1; 

} 
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w  here  t„  is  the  present  time  for  processing  and  tn, ,  is  the 
next  time  in  the  time  queue  at  which  an  event  was  sched¬ 
uled.  In  this  way.  the  "time  step"  is  handled  independenth 
for  each  node. 

This  simplified  algorithm  does  not  illustrate  how  such 
issues  as  time-step  reduction  and  local  truncation-error 
estimation  are  handled.  These  and  other  important  details 
of  the  algorithm  are  described  elsewhere  [53]. 

D.  Circuit  Examples 

Iterated  Timing  Analysis  is  an  integral  part  of  the 
SPL1CE1  6  mixed-mode  simulator  [19],  [51]  and  a  number 
of  example  runs  performed  by  this  program  are  included 
below.  SPL1CE1.6  uses  an  Successive  Over  Relaxation 
(SOR)-Newton  method  [29],  which  defaults  to  Gauss- 
Seidel-Newton  for  solving  the  nonlinear  difference  equa¬ 
tions  of  (3).  The  program  uses  event-driven  selective  trace 
analysis  to  exploit  latency  [52].  A  fundamental  limitation 
of  the  present  implementation  of  1TA,  which  reduces  its 
overall  effectiveness,  is  its  simple  time-step  control  mecha¬ 
nism.  Another  1TA  program,  SPL1CE2  [53],  is  under  devel¬ 
opment  which  overcomes  this  and  other  limitations  of 
SPL1CE1. 

For  large  digital  circuits,  SPL1CE1.6  is  still  10-50  times 
faster  than  SP1CE2  for  the  same  output  waveforms.  The 
actual  speedup  factor  depends  on  the  nature  of  the  digital 
circuit  (highly  pipelined,  random  logic,  etc.)  and  the  type 
Of  MOS  technology  in  use  (2-phase  static.  4-phase  dy¬ 
namic,  etc.).  In  each  case,  the  circuit  latency  and  strength 
of  the  bilateral  coupling  between  circuit  blocks  determines 
the  actual  speedup.  Table  1  contains  results  for  the  analysis 
of  a  large,  random  logic  circuit.  The  corresponding  output 
waveforms  are  shown  in  Fig.  12(a).  Fig.  12(b)  shows  a 
comparison  with  SP1CE2  for  a  small  glitch  in  the  wave¬ 
form.  Note  that  the  program  is  significantly  faster  than 
SP1CE2  for  substantially  the  same  waveform  information. 
The  precise  tradeoff  between  accuracy  and  speed  can  be 
adjusted  by  varying  the  convergence  criteria  of  both  pro¬ 
grams. 

A  major  drawback  of  standard  timing  simulators  is  their 
inability  to  handle  floating  elements  accurately,  in  particu¬ 
lar.  floating  capacitors.  As  shown  in  the  previous  section, 
this  is  most  apparent  when  the  value  of  the  floating  capaci¬ 
tor  is  large  with  respect  to  the  grounded  capacitors  at  each 
of  its  terminals.  Strong  feedback  and  high  gain  also  present 
a  difficult  problem  for  a  relaxation-based  1TA  program. 
Fig.  13(a)  shows  the  schematic  diagram  of  an  MOS  opera¬ 
tional  amplifier  used  as  part  of  a  phase-locked  loop  circuit 
[54],  The  circuit  was  analyzed  by  both  SPL1CE1.6  and 
SP1CE2  in  a  unity-gain  configuration.  Such  circuits  have 
always  proved  the  most  difficult  even  for  conventional 
circuit  simulators.  Note  the  large  capacitive  feedback  pro¬ 
vided  by  the  floating  compensation  capacitor.  All  transis¬ 
tors  included  parasitic  capacitors  Cf)  and  The  output 
waveforms  for  both  SPL1CE1.6  and  SP1CE2.  for  the  same 
step  input,  are  shown  in  Fig.  13(b).  The  only  differences  in 
the  waveforms  are  at  the  beginning  of  the  analysis  and  are 
due  to  slightly  different  initial  conditions  assumed  by  each 
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TABLE  I 

Comparisons  of  Convintional  Circiii  Simci.aiion.  lrtK.Mii> 
Timing  Anaiysis.  and  Timing  Simitaiion  ior  Two  Exampu. 
Cmt  utrs 


Circuit. 

MOSfTTS 

Nodes: 

Encoder/ decoder 
1.326 

653 

Operational  AjnpUfter 
15 

14 

Time 
<»>  . 

Memory 

(Kbyte) 

Time 

(*)  | 

Memory 

(Kbyte) 

SP1CE2C 

SPUCE1.6 

SPUCEI.3 

115.640 

1.740 

789 

2.420  ' 
66  9 
64  4 

'  59  8 
114.3 

29  0 
9.3 

pulie 


0.0  TIMlOiS)  4.0 

(a) 


1-60  l.et  1.92  1.63  1.64 


TIME  (mS) 

(b) 

Fig.  12  (a)  Sclccicd  waveforms  from  lhe  encodc/decodc  circuil  ob¬ 

tained  bv  SPLICE1.6.  (b)  Expanded  view  of  a  small  pulse  in  lhe 
eneode/decodc  circuit  waveforms 

program.  In  this  case,  however,  the  prototype  SPLICE1.6 
program  ran  two  times  longer  than  SPICE2.  as  shown  in 
Table  I.  While  it  is  to  be  expected  that  such  a  worst-case 
circuit  would  reduce  the  performance  of  the  program  due 
to  the  strong  capacitive  feedback  and  high  forward  gain  of 
the  circuit,  it  is  anticipated  that  this  time  difference  will  be 
reduced  as  the  SPLICE1  program  is  developed  further. 


TIME  (mS) 

(b) 

Fig  13.  (a)  Schematic  diagram  of  operational  amplifier  circuit  |b) 
Response  of  operational  amplifier  to  pulse  input. 

E.  Conclusions 

As  previously  illustrated.  Iterated  Timing  Analysis  has  a 
great  deal  of  potential  for  improving  the  performance  of 
electrical  simulation.  Not  only  can  this  technique  outper¬ 
form  standard  circuit-simulation  programs  for  the  analysis 
of  large  circuits  on  standard  computers,  but  it  offers  the 
possibility  of  a  further  substantial  speedup  when  imple¬ 
mented  on  special-purpose  hardware. 

V.  Waveform  Relaxation  Techniques 
A.  Introduction 

Both  timing  algorithms  and  iterated  timing  analysis  arc 
based  on  the  application  of  relaxation  techniques  to  the 
solution  of  circuil  equation*  at  the  nonlinear  algebraic 
equation  level.  As  pointed  out  in  Section  II,  there  is  a  part 
missing  in  Fig.  3:  a  relaxation  method  at  the  differential 
equation  level.  While  relaxation  techniques  in  the  linear 
and  nonlinear  algebraic  case  deal  with  vectors  in  R", 
relaxation  techniques  at  the  differential  equation  level  must 
deal  with  elements  in  function  spaces,  i.e.,  waveforms. 
Recently,  a  family  of  relaxation  techniques  applied  to  the 
differential  equation  level,  called  Waveform  Relaxation 
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(WR),  has  been  proposed  in  [20],  [21],  WR  methods  have 
been  implemented  in  an  experimental  circuit  simulator 
called  RELAX  [21]  that  has  proven  to  be  effective  for  the 
accurate  analysis  of  some  MOS  digital  circuits  with  more 
than  an  order  of  magnitude  speed  improvement  over 
standard  circuit  simulators. 

In  this  Section,  the  basic  ideas  of  WR  methods  are 
reviewed  and  some  of  WR  applications  and  extensions  are 
presented.  To  begin,  a  simple  example  is  used  to  illustrate 
the  method,  and  then  the  general  “Gauss- Seidel"  algo¬ 
rithm  in  the  WR  family  for  MOS  digital  circuits  is  de¬ 
scribed.  A  more  detailed  and  complete  description  of  these 
techniques  is  available  in  [20],  [21]. 

B  The  Waveform-Relaxation  Gauss- Seidel  Algorithm 

Consider  the  first-order  two-dimensional  differential 
equation  in:  .v(/ )  e  R;  on  /  e  [0.  T). 

*i  =/,(.v,,.x:,r),  x,(0)  =  .v10  (44a) 

x3  -x2(0)  =*20  t44**) 

The  basic  idea  of  the  “Gauss- Seidel"  waveform-relaxation 
algorithm  is  to  fix  the  waveform  jc2:  [0. 7]  -*  R  and  solve 
(44a)  as  a  one-dimensional  differential  equation  in  *,(•). 
The  solution  thus  obtained  for  x,  can  be  substituted  into 
(44b)  which  will  then  reduce  to  another  first-order  differen¬ 
tial  equation  in  one  variable,  ,v2.  Equation  (44a)  is  then 
resolved  using  the  new  solution  for  x2(t),  and  the  proce¬ 
dure  is  repeated. 

In  this  fashion,  an  iterative  algorithm  has  been  con¬ 
structed.  It  replaces  the  problem  of  solving  a  differential 
equation  in  two  variables  by  one  of  solving  a  sequence  of 
differential  equations  in  one  variable.  As  described  earlier, 
the  waveform  relaxation  algorithm  can  be  seen  as  an 
analogue  of  the  Gauss-Seidel  technique  for  solving  nonlin¬ 
ear  algebraic  equations.  Here,  however,  the  unknowns  are 
waveforms  (elements  of  a  function  space),  rather  than  real 
variables.  In  this  sense,  the  algorithm  is  a  technique  for 
time-domain  decoupling  of  differential  equations. 

WR  algorithms  applied  to  circuits  can  be  formulated  in 
a  number  of  ways.  A  “Gauss-Seidel"  WR  algorithm  for 
MOS  circuits  will  be  considered  in  the  following  analysis. 
Recall  that,  according  to  (2),  the  circuit  equations  are 
formulated  as 

C(v.u)b  +  f(i\u)~0,  d(0)  -  V  (45) 

where  C:  R"  XR'  -♦  R"x"  is  a  symmetric  diagonally  domi¬ 
nant  matrix-value  function  in  which  -C,;(c.  u)\  i  *  j  is 
the  total  floating  capacitance  between  nodes  i  and  j. 
C,,(v,u )  is  the  sum  of  the  capacitances  of  all  capacitors 
connected  to  node  /,  and  /:  R'xR"  XR'-*R"  is  a  con¬ 
tinuous  function,  each  component  of  which  represents  the 
net  current  charging  the  capacitor  at  each  node  due  to 
the  pass  transistors,  the  other  conductive  elements,  and  the 
controlled  current  sources. 

Algorithm  V-B-l  {WR  Gauss -Seidel  Algorithm  for  Solv¬ 
ing  (45))  Comment : 

The  superscript  k  denotes  the  iteration  count,  the  sub¬ 
script  /  denotes  the  component  index  of  a  vector,  and  <  is  a 


small  positive  number. 
k  —  0; 

guess  waveform  i-°( / ):  t  e  [0.  7]  such  that  i  "(0)  =  1 
(for  example,  set  rM(/)  —  1.  /  e  [0.  7]): 

repeat  ( 

A  -  k  -v  1 


foreach  ( i  in  ,V  )  { 
solve 


I  %)(•;  + 


t  *  i 


t  ~  t  +  \ 

1 ,  ti )  =  0 


for  (vj'(t):  /  e  [0.  7]).  with  the  initial  condition 

i’*(0)“  Vr 

) 

) 

until  (maxl<,<(,max,«|0>7-||t>*(/)-t4  '(/ )|  «  c) 
that  is.  until  the  iteration  converges.  ■ 


Note  that  (45)  has  only  one  unknown  variable  r*.  The 
variables  iff/,  -  ■  • , t/v~ 1  are  known  from  the  previous  itera¬ 
tion  and  the  variables  u*,-  •  • , d*_ ,  have  already  been  com¬ 
puted.  Note  also  that  the  Gauss-Jacobi  version  of  the  WR 
algorithm  presented  earlier  can  be  obtained  simply  by 
replacing  the  foreach  statement  with  the  forall  statement 
and  adjusting  the  iteration  indices  in  the  same  way  as  can 
be  done  in  the  Gauss-Jacobi  version  of  ITA.  In  the 
Gauss-Jacobi  case,  the  computation  can  be  performed  in 
parallel,  and  hence  it  is  more  suitable  for  implementation 
on  special-purpose  hardware. 

As  an  example  of  how  Algorithm  V-B-l  can  be  applied, 
consider  the  MOS  circuit  shown  in  Fig.  14.  For  the  sake  of 
simplicity  it  is  assumed  that  all  capacitors  are  linear.  Hence 
the  dynamical  behavior  of  the  circuit  can  be  described  as 
follows: 

(c,  +  c:  +  -(,(d,)  +  (2(l,,,Ii) 

+  fj(  f|,  lt:.  t';  )  —  C,U,  -  Cjilj  =  0 
( c4  +  r,  +  ct ) t\  -  C(,( f’j )- 1',(  c,.  u:.  i\ ) -  c4ii2  =  0 
(  c,,  -  c7 ) r,  -  cv'\  -  f4(  t>, )  +  u (  c, .(•-)  =  (). 

(4b) 

Applying  the  WR  procedure  to  (46)  the  Ath  iteration 
corresponds  to  solving  the  following  equations: 

(c,  +  c;  +  cj)r{-(j(c{)  +  (2(c{.;ii) 

+  (,((•{.  u2.  r2  1 )  —  f|ti|  —  c,u2  -  0 
(c4  +  c5  +  ct )  ri  -  c^i\  ~l  -  ij(v\.  u:.  ri )-  c4u:  =  0 
( ct  +  c7 )  ij  ~  c6c2  -  t4  ( t>{ )  +  i 5  ( v\ .  r( )  =  0 

(47) 

The  circuit  interpretation  of  (47)  is  shown  in  Fig.  15 
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Fig  14  Circuit  example  for  Gauss-Seidel  waveform  relaxation  algo¬ 
rithm 
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Fig.  15.  Circuit  interpretation  of  Gauss-Seidel  waveform  relaxation 
annlicd  to  the  circuit  if  Fig.  14. 


If  the  original  circuit  in  Fig.  14  consists  of  3  subcircuits 
j,,  s2,  and  j3,  then  the  decomposed  subcircuits  i,.  s2,  and 
Jj  together  with  additional  components  to  approximate  the 
loading  effects  due  to  the  rest  of  the  circuit.  Hence,  the 
WR  procedure  for  analyzing  the  circuit  in  Fig.  14  can  be 
described  in  circuit  terms  as  follows: 

*♦-0; 

make  an  initial  guess  of  o2(t),  v°(t)\  t  e  [0, 7*]; 

repeal  { 

k  *-  kp  \ 

analyze  j,  for  its  output  waveform  v\()  by 
approximating  the  loading  effect  due  to  s2\ 

analyze  s2  for  its  output  waveform  Cj(')  by 
using  t;*(  )  as  its  input  and  approximating 
the  loading  effect  due  to  s}\ 

analyze  j3  for  its  output  waveform  it*  ( • )  by 
using  oj(-)  as  its  input; 

} 

until  (the  difference  between  {(t;*(r).  v2(t).  >  6 

[0.  T])  and 

t  6  (0,  7*]}  is  sufficiently 
*mall)  ■ 

From  the  previous  procedure  and  example  it  can  be  seen 
that  each  component  of  the  decomposition  is  a  dynamical 
subcircuit  which  is  processed  for  the  entire  time  interval 
(0.  T\  in  a  fixed  order.  When  each  subcircuit  is  being 
processed,  the  iterations  (coupling  or  loads)  from  the  rest 
of  the  circuit  are  approximated  by  using  the  information 
obtained  from  the  most  recent  iteration.  The  iteration  is 
carried  out  until  satisfactory  convergence  of  all  waveforms 


is  detected.  It  can  be  shown  that  for  the  MOS  circuit  in 
Fig.  14  the  sequence  of  waveforms  generated  by  the  WR 
procedure  will  always  converge  to  the  correct  waveform 
independent  of  the  initial  guess  provided  that  c2,  c5,  and  c7 
are  not  zero.  In  (20).  [50],  a  strong  convergence  resul*.  was 
proved  that  can  be  applied  to  Algorithm  V-B-l  as  follows. 

Theorem  V-B-l:  Assume  that: 

1)  the  charge-voltage  characteristic  of  each  capacitor,  the 
current-voltage  characteristic  of  each  conductor,  and  the 
drain-current  characteristic  of  each  MOS  device  are 
Lipschitz  continuous  with  respect  to  their  controlling  vari¬ 
ables; 

2)  Cmm  >  0  and  Cm„  <  co  where  Cmin  e  R  is  the  mini¬ 
mum  value  of  all  grounded  capacitors  at  any  permissible 
value  of  node  voltage,  and  Cmjx  e  R  is  the  maximum  value 
of  all  floating  capacitors  at  any  permissible  value  of  node 
voltages;  and 

3)  the  current  through  any  controlled  conductor  (e.g.,  the 
drain  current  of  an  MOS  device)  is  uniformly  bounded 
throughout  the  relaxation  process. 

Then,  for  any  given  set  of  initial  conditions  and  any  given 
piecewise  continuous  input  u(-).  Algorithm  V-B-l  gener¬ 
ates  a  converging  sequence  of  iterated  solutions  whose  limit 
satisfies  the  circuit  equations  and  the  given  initial  condi¬ 
tions.  ■ 

C.  Waveform  Relaxation  in  RELAX 

The  WR  "Gauss-Seidel”  algorithm  was  implemented  in 
an  experimental  circuit  simulator,  RELAX.  Actually. 
RELAX  implemented  a  modified  version  of  the  WR  algo¬ 
rithm  described  previously.  These  modifications  were  intro- 
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duced  to  impro\e  the  speed  of  convergence  of  the  algo¬ 
rithm  and  exploit  the  structure  of  the  class  of  circuits  to  he 
analyzed,  i.e.,  MOS  digital  circuits.  These  modifications  are 
as  follows 

1)  Rather  than  having  strictly  one  unknown  per  each 
component  of  the  decomposition  as  stated  in  Algorithm 
V-B-l,  RELAX  allows  each  decomposed  subcircuit  to  have 
more  than  one  unknown.  This  corresponds  to  a  block 
relaxation  method  that  can  be  proven  to  have  similar 
convergence  properties.  In  fact,  the  analysis  of  the  circuit  is 
decomposed  into  the  analysis  of  subcircuits  each  of  which 
corresponds  to  a  physical  subcircuit  that  is  built  into  the 
program  and  called  by  the  user  as  a  unit,  such  as  a  NOR  or 
a  NAND. 

2)  Each  decomposed  subcircuit  is  processed  by  using 
standard  circuit-analysis  techniques  The  Backward  Euler 
integration  method  with  variable  time  steps  is  used  to 
discretize  the  differential  equations  associated  with  the 
subcircuit,  and  the  Newton- Raphson  method  is  used  to 
solve  the  nonlinear  algebraic  equations  resulting  from  the 
discretization.  Since  the  number  of  unknowns  associated 
with  a  subcircuit  is  usually  small,  the  linear  equation  solver 
used  by  the  Newton- Raphson  method  is  implemented  by 
using  standard  full-matrix  techniques  rather  than  using 
sparse-matrix  techniques.  Note  that  in  RELAX  each  sub¬ 
circuit  is  analyzed  independently  from  /  =  0  to  /  =  T,  using 
its  own  time-step  sequence,  controlled  by  the  integration 
method,  whereas  in  a  standard  circuit  simulator  the  entire 
circuit  is  analyzed  from  /  »  0  to  /  =  T  using  only  one 
common  time-step  sequence.  In  RELAX,  the  time-step 
sequence  of  one  subcircuit  is  usually  different  from  the 
others,  but  contains,  in  general,  a  smaller  number  of  time 
steps  than  that  used  in  a  standard  circuit  simulator  for 
analyzing  the  same  circuit. 

3)  The  order  according  to  which  each  subcircuil  is 
processed  is  determined  in  RELAX  prior  to  starting  the 
iteration  by  a  subroutine  called  “scheduler.”  Although, 
according  to  Theorem  V-B-l,  scheduling  is  not  necessary  to 
guarantee  convergence  of  the  iteration,  it  does  have  an 
impact  on  the  speed  of  convergence  as  is  the  case  for  the 
relaxation  methods  at  the  linear-  and  nonlinear-equation 
levels.  Assume  now  that  the  circuit  consists  of  unidirec¬ 
tional  subcircuits  with  no  feedback  path.  Exactly  as  for  the 
other  relaxation  methods  introduced  before,  if  the  subcir¬ 
cuits  are  processed  according  to  the  flow  of  signals  in  the 
circuit,  the  algorithm  used  in  RELAX  v  converge  in  just 
two  iterations  (actually  the  second  iteration  is  needed  only 
to  venfy  that  convergence  has  been  obtained).  For  MOS 
digital  circuits  which  contain  almost  unidirectional  subcir¬ 
cuits,  it  is  intuitive  that  convergence  of  the  WR  procedure 
will  be  achieved  more  rapidly  if  the  subcircuits  are 
processed  according  to  the  flow  of  signals  in  the  circuit. 
The  scheduler  traces  the  flow  of  signals  through  the  circuit 
and  generates  a  static  order  for  the  processing  of  subcir¬ 
cuits.  To  be  able  to  trace  the  flow  of  signals,  the  scheduler 
requires  the  user  to  specify  the  flow  of  signals  through 
each  subcircuit  by  partitioning  the  terminal  of  the  subcir- 
cuit  into  input  and  output  terminals.  This  is  needed  since 


in  RELAX  the  basic  unit  is  a  subcircuil.  In  general,  a  de¬ 
signer  can  easily  specify  what  the  flow  of  the  signals  is 
intended  to  be  even  in  a  subcircuit  which  is  not  unidirec¬ 
tional  such  as  a  transmission  gate  or  a  subcircuit  contain 
ing  floating  capacitors  between  its  input  and  output 
terminals.  The  analysis  algorithm  in  RELAX  will  indeed 
take  into  account  the  bidirectional  effects  correcth 

4)  The  first  iteration  in  RELAX  is  carried  out  b\  assum¬ 
ing  that  there  is  no  loading  effect  due  to  fanouts.  The 
“standard”  WR  procedure  actually  begin  at  the  second 
iteration  in  RELAX.  Hence,  strictly  speaking,  the  first 
iteration  in  RELAX  is  used  to  generate  a  good  initial  guess 
for  the  actual  WR  procedure. 

D.  Speed-up  Techniques 

In  addition  to  the  previous  modifications.  RELAX  in¬ 
corporates  two  bypass  techniques  to  speed  up  the  process 
of  analyzing  a  subcircuit.  The  key  idea  is  once  more  to 
bypass  the  analysis  of  a  subcircuit  for  certain  time  intervals 
without  losing  accuracy  by  exploiting  the  information  ob¬ 
tained  from  previous  time  points  and/or  from  previous 
iterations. 

The  two  techniques  used  in  RELAX  are  presented  here 
by  showing  their  application  for  the  analysis  of  the  subcir¬ 
cuit  r,  of  the  circuit  shown  in  Fig.  16.  which  is  a  schematic 
diagram  of  the  circuit  in  Fig.  14.  The  output  voltages  of  s, 
and  s2  at  the  /rth  iteration  are  denoted  by  v\  and  v2, 
respectively. 

The  first  technique  is  based  on  the  latency  of  j,  and  is 
similar  to  the  technique  described  in  Section  III.  According 
to  (47),  s,  is  analyzed  in  the  first  iteration  with  no  loading 
effect  from  s2.  After  it  has  been  analyzed  for  a  few  time 
points,  its  output  voltage  c1,  is  found  to  be  almost  constant 
with  time.  i.e..  o‘(0.01)  *  0  (see  Fig.  16(b)).  Since  the  input 
of  j,,  i.e.  u,  is  also  constant  during  the  interval  [0.01.  1.9], 
the  subcircuit  s,  is  said  to  be  “latent”  in  the  first  iteration 
during  the  interval  [0.01,  1.9],  and  its  analysis  during  this 
interval  is  bypassed.  From  Fig.  16(b).  s,  is  latent  again  in 
the  interval  [2.15,  3].  Note  that,  according  to  (47).  the 
check  for  latency  of  s ,  after  the  first  iteration  will  include 
u2  and  u2  as  well  as  «,  since  they  can  affect  the  value  of  r,. 
For  most  digital  circuits,  the  latency  intervals  of  a  subcir¬ 
cuil  usually  cover  a  large  portion  of  the  entire  simulation 
lime  interval  [0,  T\  and  hence  the  implementation  of  this 
technique  can  provide  a  considerable  saving  of  computing 
lime  as  shown  in  Table  II. 

The  second  technique  is  based  on  the  partial  convergence 
of  a  waveform  during  the  previous  two  iterations.  This 
technique  is  introduced  by  using  the  example  of  Fig.  16 
After  the  first  two  iterations,  we  observe  that  the  values  of 
r\  and  cf  during  the  interval  [1.7,  3.0]  do  not  differ 
significantly  (see  Fig.  16(b).  (c)).  i.e..  the  sequence  of 
waveforms  of  r,  seem  to  converge  in  this  interval  after  two 
iterations.  In  the  third  iteration,  shown  in  Fig.  16(d).  s,  is 
analyzed  from  t  ■*  0  to  t  =  1.8  and  t-J  (1.8)  is  found  to  be 
almost  the  same  as  cj  (1.8).  Moreover,  during  the  interval 
[1.8,  3],  the  value  of  rj  which  affects  the  value  of  r}  also 
does  not  differ  significantly  from  the  values  of  rl  (which 
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1ST  ITERATION 


(b) 

ITERATION 


2nd  ITERATION 


Fig.  16.  (a)  Schema..*  diagram  cf  the  circuil  of  Fig  14.  (b).  (c).  and  (d). 
The  waveforms  of  the  two  node  voltages  at  the  first,  second  and  third 
waveform  relaxation  iteration,  respectively,  (e).  The  difference  of  the 
waveforms  between  the  first  and  second  iteration. 


affects  the  value  of  uf ).  Hence  the  value  of  uj  during  the 
interval  [1.8,  3]  should  remain  the  same  as  u*  and  the 
analysis  of  r,  during  this  interval  in  the  third  iteration  will 
be  bypassed.  This  technique  can  provide  a  considerable 
saving  of  computing  time  as  shown  in  Table  II  since  the 
intervals  of  convergence  can  cover  a  large  portion  of  the 
entire  simulation  time  interval  [0,  7"],  especially  in  the  last 
few  iterations.  Note  that  the  subcircuit  need  not  be  latent 
during  the  intervals  of  convergence  although  overlapping 
of  these  intervals  with  the  latency  intervals  is  possible. 

E  Some  Extensions  of  Waveform-Relaxation  Techniques: 
RELAX2 

RELAX  is  written  in  FORTRAN77.  It  can  handle  MOS 
digital  circuits  containing  NOR  gates,  NAND  gates,  trans¬ 
mission  gates,  multiplexers  (or  banks  of  transmission  gates 
whose  outputs  are  connected  together),  super  buffers,  and 
cross-coupled  NOR  gates  (or  RS  flip-flops).  It  uses  the 
Shichman- Hodges  model  [33]  for  the  MOS  device.  All  the 
computations  were  performed  in  double  precision  and 
the  results  were  also  stored  in  double  precision.  Although 
the  RELAX  code  is  rather  small,  approximately  4000 
FORTRAN  lines,  it  requires  a  large  amount  of  storage  for 
the  waveforms,  especially  when  large  circuits  are  analyzed. 
For  an  MOS  circuit  containing  1000  nodes  with  100  analy¬ 
sis  time  points  per  node,  the  waveform  storage  requires 
approximately  3  x  I000X  1000  floating  point  numbers  (cor¬ 
responding  to  2.4  Mbytes  if  each  number  is  stored  in  64 
bits). 


TABLE  If 

Comparison  of  CPU  Times  Used  by  RELAX  for  Analyzing 
the  Circuit  of  Fig  0  with  and  without  the  Latency  and 
the  Partial  Waveform  Convergence  Techniques 
(Case  1:  Wilhoul  lhe  laiency  and  ihe  partial  waveform 
convergence  techniques. 

Case  2:  With  only  the  latency  technique. 

Case  3:  With  only  the  partial  waveform  convergence  technique. 
Case  4:  With  both  the  latency  and  the  partial  waveform 
convergence  techniques.) 


CPU  Urn.  ( wcondi ) 


lUruiM  f 


CAN  1 

one  2 

CAM  4 

0.363 

0JSS 

OXfO 

0.333 

0.664 

0.704 

0J14 

0.663 

0JI1 

0.703 

0.643 

0.336 

0J3S 

0.704 

0.161 

0.104 

0.631 

0.603 

0.067 

0X16 

3J66 

3.063 

1.164 

1.311 

A  new  version  of  RELAX.  RELAX2  [55],  written  in  C. 
has  extended  and  made  waveform-relaxation  algorithms 
more  practical.  The  first  important  extension  consists  of 
allowing  arbitrary  subcircuits  to  be  defined  by  the  user,  as 
was  provided  in  the  original  SPLICE1  program.  These 
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Fig  17.  A  two  cross- coupled  NAND-gate  circuit. 


subcircuits  arc  analyzed  by  a  subroutine  patterned  after 
SPICE,  while  the  original  RELAX  used  “hard-wired"  sub¬ 
circuit  analyzers  made  possible  by  the  fixed  structure  of  the 
subcircuits  allowed  by  the  program.  Thus  RELAX2  is 
slower  than  RELAX,  but  still  maintains  a  definite  ad¬ 
vantage  over  standard  circuit  simulators,  approximately 
one  order  of  magnitude  speed  improvement. 

The  second  extension  consists  of  allowing  the  decom¬ 
position  of  the  time  interval  of  interest  for  the  time-domain 
simulation  into  subintervals,  called  windows  {56], 

Digital  circuits  can  be  broken  up  into  two  very  broad 
classes:  circuits  with  logic  feedback  loops  (finite-state  ma¬ 
chines,  asynchronous  circuits,  digital  oscillators)  and  cir¬ 
cuits  without  logic  feedback  loops  (most  combinational 
logic,  programmable  logic  arrays).  Experience  simulating 
MOS  digital  circuits  using  RELAX2  shows  that  most  MOS 
digital  circuits  without  logic  feedback  loops  converge  in 
less  than  ten  iterations.  However,  circuits  with  logic  feed¬ 
back  loops  may  take  many  more  iterations  to  converge, 
and  the  number  of  iterations  required  is  proportional  to 
the  length  of  the  simulation  interval. 

This  suggests  that  the  interval  of  simulation  should  be 
broken  into  “windows",  [0,  7\).  [7",, T2]% -  - j.  Tn).  so 
that  the  relaxation  will  converge  rapidly  in  each  window. 
Waveform  relaxation  is  applied  to  the  first  window.  [0.  7~,  J 
and  the  values  of  the  node  voltages  at  T,  are  then  used  as 
initial  conditions  for  the  analysis  of  the  second  window. 
This  procedure  is  repeated  until  all  windows  have  been 
analyzed. 

Consider  the  analysis  of  the  cross-coupled  NAND  gate 
circuit  shown  in  Fig.  17  with  the  “window"  approach 
provided  by  RELAX2;  convergence  is  quite  rapid  (see 
Table  III).  There  is  a  trade-off.  however.  As  the  window 
size  gets  smaller  some  of  the  advantages  of  waveform 
relaxation  are  lost.  One  cannot  take  advantage  of  a  digital 
circuit’s  natural  latency  over  the  entire  waveform,  but  only 
within  that  window.  The  scheduling  overhead  increases 
when  the  windows  become  smaller,  as  each  subcircuit  must 


TABLE  in 

Windowing  Experiments  for  the  Cross-Coupled  NAND 
Gates  Performed  on  a  VAX1 1/780  Running  Berxeley 
UNIX4  lc 


be  scheduled  once  for  each  window  ,  and  if  the  windows  are 
made  very  small,  time  steps  chosen  to  calculate  the  wave¬ 
forms  will  be  limited  by  the  window  size  rather  than  by  the 
local  truncation  error  and  unnecessary  calculations  will  be 
performed. 

Breaking  the  interval  simulation  into  windows  also  has 
the  advantage  of  reducing  the  memory  requirements  of 
WR,  since  these  arc  proportional  to  the  number  of  time 
points  used  by  the  numerical  integration  algorithm  used  to 
solve  the  subcircuit  equations. 

Other  extensions  to  WR  included  in  RELAX2  involve 
the  approximate  solution  of  the  subcircuit  equations  [56). 
(57).  When  using  iterative  decomposition  methods  for  solv¬ 
ing  systems  of  nonlinear  equations,  it  may  be  possible  to 
reduce  the  calculations  required  by  not  solving  the  decom¬ 
posed  nonlinear  equations  exactly  at  each  iteration.  In 
some  cases  the  convergence  of  the  algorithm  is  not  affected 
by  the  inaccurate  solutions.  An  example  is  the  Gauss- 
Seidel-Newton  method  [27)  described  in  Section  II.  In  the 
WR  case,  simpler  approximate  methods  for  calculating  the 
node  waveforms  are  used  for  the  first  few  iterations:  more 
complex  and  more  exact  methods  arc  used  for  the  last  few 
iterations. 

One  way  of  simplifying  the  calculation  of  the  node-volt¬ 
age  waveforms  is  to  use  a  simple  model  for  the  MOS 
devices  and  then  to  switch  to  the  more  detailed  model  as 
the  waveforms  approach  convergence.  The  simple  device 
model  used  in  REL.AX2  is  a  resistor  in  senes  with  a  switch. 
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where  the  size  of  the  resistance  is  scaled  with  the  device 
size.  Using  such  a  model  in  the  calculation  of  waveforms  is 
not  straightforward,  because  the  equations  describing  the 
model  can  not  be  solved  easily  using  the  Newton -Raphson 
method.  The  Newton-Raphson  method  will  often  oscillate 
about  the  point  where  the  simple  model's  switch  changes 
state.  One  solution  to  this  problem  is  not  to  carry  out  the 
Newton-Raphson  method  to  convergence,  but  to  do  only 
one  iteration.  The  result  is  that  the  calculation  of  the 
waveforms  using  the  simple  model  is  quite  fast,  but  only 
approximate,  even  if  the  simple  model  is  assumed  to  be 
correct.  The  results  obtained  using  the  simple  model  and 
then  changing  to  the  more  detailed  model  have  been  disap¬ 
pointing  so  far.  as  demonstrated  in  the  examples  of  Table 
IV.  In  circuits  without  logic  feedback,  the  simple  model  did 
not  provide  a  better  guess  for  the  waveforms  than  one 
iteration  using  more  complex  models.  It  is  possible  that  the 
addition  of  another  term  to  the  simple  model,  to  make  it 
more  smooth,  may  help.  Then  the  Newton-Raphson  algo¬ 
rithm  can  be  used  and  achieve  the  accuracy  required  to 
produce  a  useful  first  guess  for  the  iterations  using  the 
more  detailed  models. 

Another  approach  to  simplifying  the  calculations  per¬ 
formed  in  the  first  few  iterations  of  the  WR  algorithm  is  to 
allow  the  numerical  integration  algorithm,  which  is  used  to 
solve  for  the  node  waveforms  of  the  decomposed  subcir¬ 
cuits,  to  use  a  larger  local  truncation  error.  Here,  unlike 
changing  me  device  models,  it  is  possible  to  increase  the 
accuracy  of  the  calculation  of  the  node  waveforms  at  each 
iteration  by  tightening  the  local  truncation  error  limit.  In 
the  case  of  RELAX,  since  most  circuits  converge  in  about 
5  iterations,  a  local  truncation  error  was  chosen  that  is 
about  3  times  larger  than  the  local  truncation  error  that 
would  be  chosen  to  calculate  the  waveforms  for  the  final 
answer.  Then  after  each  iteration  the  local  truncation  error 
is  multiplied  by  0.7.  The  results  from  this  approach  are 
shown  in  the  last  row  of  Table  IV. 

A  key  question  to  be  asked  is  whether  the  convergence  of 
WR  algorithms  is  affected  by  these  approximation  tech¬ 
niques.  Note  that  Theorem  V-B-l  assumes  that  the  solu¬ 
tions  of  the  differential  equations  are  computed  exactly. 
However,  the  convergence  of  the  WR  algorithm  to  the 
exact  solution  of  the  original  differential  equations  has 
been  proven  provided  the  accuracy  of  the  integration  is 
increased  while  the  WR  iterations  are  converging  [56].  The 
framework  necessary  to  prove  this  result  is  the  one  of 
nonstationary  WR  algorithms. 

Algorithm  V-B-l  is  a  stationary  algorithm  in  the  sense 
that  the  iteration  process  is  performed  with  the  same  set  of 
equations.  Nonstationary  WR  algorithms  are  chaiacterized 
by  the  fact  that  the  equations  describing  the  system  at  each 
iteration  can  change  from  one  iteration  to  the  other.  Note 
that  the  approximations  computed  by  the  integration 
methods  can  be  viewed  as  the  exact  solutions  of  perturbed 
differential  equations.  Thus  the  iteration  equations  of  the 
WR  algorithms  can  be  seen  as  changing  from  iteration  to 
iteration.  The  convergence  theorem  proven  in  [50],  [56], 
assumes  that  the  accuracy  of  the  integration  methods  is 


TABLE  I\ 

Explrimlntswiih  Variabil  Acc  crao  Moot i  and  Variable 
Lihm  Tri  nc  ahon  Error  iLTEi 


TEST  CIRCHTS 


Method 

Shi/t  Cell 

T*o  Phase  Cik 

Memory  Cell 

#  Her 

Time 

0  Iter 

Time 

•  iter  Time 

SPICE 

- 

12.52 

- 

♦  3  13 

13.63 

RELAX  2 

4 

2  10 

4 

5  47 

4  2  96 

Simple  model 

< 

3  20 

4 

6.75 

4  4  45 

Simple  model  only 

« 

0  49 

< 

0  59 

4  0.6 e 

LTE 

5 

I  24 

3 

38! 

4  ,  271 

controllable  and  that  in  the  limit,  the  exact  solution  of  a 
differential  equation  is  achieved.  Fortunately,  this  property 
is  obviously  satisfied  when  using  simpler  models  in  the  first 
iterations  to  switch  to  more  complex  and  accurate  models 
at  the  last  iterations.  In  addition,  it  is  known  that  con¬ 
sistent  integration  methods  can  be  made  as  accurate  as 
desired  by  controlling  parameters  such  as  the  local  trunca¬ 
tion  error.  Thus  the  speed-up  techniques  presented  here  are 
theoretically  sound. 

F.  Conclusions 

Waveform-relaxation  methods  have  been  proven  to  be 
effective  decomposition  methods  for  the  analysis  of  large- 
scale  MOS  circuits.  In  particular,  the  methods  have 
guaranteed  convergence  properties.  Since  WR  algorithms 
are  quite  new.  more  research  is  needed  to  characterize 
completely  the  trade-offs  involved  in  the  choice  of  a  partic¬ 
ular  method  in  the  class  (e.g..  Gauss-Jacobi  versus 
Gauss- Seidel).  In  addition,  an  accurate  comparison  be¬ 
tween  iterated  timing  analysis  and  waveform  relaxation 
must  be  carried  out. 

It  is  clear  that  WR  methods  are  quite  suitable  for 
implementation  on  a  parallel  or  pipeline  architecture  since 
they  allow  different  subcircuits  to  be  analyzed  concurrently 
on  different  processors. 

In  addition.  WR  algorithms  have  recently  been  extended 
to  piecewise  linear  circuit  analysis  [57]  and  to  other  fields 
such  as  electrophoresis  process  simulation  [58]. 

VI.  Summary  and  Directions  for  Future  Work 
A.  Introduction 

Relaxation-based  simulation  techniques  have  been  used 
for  the  analysis  of  electronic  circuits  in  many  ways.  Timing 
simulators  were  the  first  relaxation-based  electrical  simula¬ 
tors  to  gain  widespread  use  and  are  still  being  used  success¬ 
fully  in  many  companies  today.  Unfortunately,  since  only  a 
single  sweep  of  a  relaxation  method  is  used  to  approximate 
the  solution  of  the  set  of  nonlinear  algebraic  equations 
obtained  at  a  time  point,  these  simulators  suffer  from 
severe  accuracy  problems  when  used  to  analyze  circuits 
containing  tight  feedback  loops  or  floating  circuit  elements. 
In  Section  111  we  presented  an  analysis  of  the  numerical 
techniques  used  in  timing  simulation,  and  described  a 
number  of  algorithms  which  can  be  used  to  improve  the 
accuracy  of  timing  simulators.  In  particular,  methods  suited 
to  the  analysis  of  floating  capacitors  have  been  described 

While  timing  simulators  will  continue  to  be  used  for  the 
analysis  of  circuits  where  constrained  circuit  design  meth- 
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ods,  such  as  cell-based  approaches,  limn  the  likelihood  of 
simulation  errors,  they  cannot  be  used  for  the  analysis  of 
complex  digital  and  analog  circuits  where  feedback  effects 
are  significant.  A  new  relaxation-based  approach,  called 
iterated  timing  analysis,  can  be  used  for  the  analysis  of 
these  circuits.  We  have  described  the  basic  algorithms  used 
for  1TA  and  their  associated  numerical  properties  in  Sec¬ 
tion  IV.  This  approach  provides  accurate  simulation  results 
while  still  achieving  a  substantial  speed  improvement  over 
conventional  circuit-analysis  techniques.  As  shown  in  Sec¬ 
tion  IV,  ITA  has  also  been  used  successfully  for  the  analy¬ 
sis  of  tightly  coupled  analog  circuits  containing  a  number 
of  floating  capacitors. 

Both  of  the  aforementioned  techniques  apply  relaxation 
methods  to  the  solution  of  a  set  of  nonlinear  algebraic 
equations.  In  contrast  to  these  techniques,  the  waveform- 
relaxation  method  uses  a  relaxation  approach  at  the  dif¬ 
ferential  equation  level.  In  Section  V  we  have  shown  that 
waveform  relaxation  has  guaranteed  convergence  proper¬ 
ties  for  a  wide  class  of  electrical  circuits,  and  has  per¬ 
formed  over  one  order  of  magnitude  faster  than  standard 
circuit  simulators  on  a  number  of  test  circuits  while  main¬ 
taining  the  same,  or  even  better,  accuracy.  A  number  of 
improvements  and  extensions  to  the  basic  waveform- 
relaxation  method  were  also  presented. 

On  conventional  computers,  the  speed  advantage  of 
relaxation-based  analysis  over  matrix-based  techniques  can 
vary  from  a  slight  slow-down,  for  small  tightly  coupled 
analog  circuits,  to  a  maximum  of  about  two  orders  of 
magnitude  speedup,  for  large  semistatic  digital  circuits. 

B.  Special-Purpose  Hardware 

The  use  of  special-purpose  computer  instructions  for 
sparse-matrix  solution  [14]  and  the  use  of  vector  com¬ 
puters,  such  as  the  CRAY1  [17],  can  improve  the  speed  of 
conventional  circuit  simulators  by  about  an  order  of  mag¬ 
nitude  over  their  nonoptimized  versions  on  the  same  ma¬ 
chine.  In  the  latter  case,  the  speedup  is  limited  by  the 
gather/ scatter  problem  [46]  associated  with  arranging  the 
data  so  that  effective  parallel  computation  can  be  per¬ 
formed.  Unfortunately,  the  irregular  structure  of  a  circuit 
sparse  matrix  is  the  limiting  factor  here. 

If  relaxation  techniques  are  used  to  replace  these  direct 
methods,  the  solution  of  each  node  equation  is  effectively 
decoupled  from  the  others.  While  such  decoupled-analysis 
techniques  would  be  suitable  for  use  on  a  vector  computer, 
it  seems  that  other  architectures,  in  particular  data-flow 
computers  and  related  dependency-driven  approaches  [48], 
[49],  [59]— [62],  will  allow  the  decoupling  to  be  exploited 
more  effectively.  A  straight-forward  approach  to  the  imple¬ 
mentation  of  an  electrical  relaxation  simulator  on  such  a 
computer  would  be  to  allocate  a  separate  processor  for  the 
solution  of  each  decoupled-node  equation.  For  an  ITA 
algorithm,  the  calculation  performed  by  each  processor 
would  be  a  single  Newton-Raphson  step  on  a  nonlinear 
algebraic  equation  in  one  unknown.  In  the  case  of  WR, 
each  calculation  would  involve  the  computation  of  a  partial 
waveform,  or  set  of  waveforms,  on  the  processor.  While  the 
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performance  of  a  practical  multiprocessor  depends  on  mans 
factors,  a  simplified  analysis  is  presented  here  to  illustrate 
the  potential  savings  of  such  a  machine 

For  a  circuit  containing  .V  nodes  with  M  nodes  actively 
changing  at  any  time.  A/  ^  A’,  the  total  time  spent  solving 
the  independent  node  equations  on  a  serial  computer  is 
approximately 

T(M)a  A/r(  (48) 

where  ts  is  the  time  required  to  solve  the  single-node 
equation  in  either  scheme.  Consider  a  multiprocessor  using 
a  single-  or  multiple-stage  shuffle  network  [61],  [62]  with  an 
element  cycle  time  of  and  a  latency  proportional  to 
klog(P),  where  P  is  the  number  of  ports  in  the  shuffle 
network  and  k  is  a  constant.  For  now  assume  P  >  A/;  then 
the  total  analysis  time  on  such  a  network  is  approximately 

Tp(P)  <X  ts  +  ltk\og(P).  (49) 

Equation  (49)  is  in  fact  a  worst-case  figure  because  it 
assumes  all  communication  is  on  the  critical  path  of  the 
computation  and  that  there  is  no  pipelining  of  requests. 
The  speed-up  factor  for  the  parallel  computation  is  then 


Mts 

ts  +  tck\og(P) ' 


(50) 


If  k  is  from  1  to  3  [63];  if  M  =  P\  if  r,,  the  speed-up 
becomes  approximately  A//log  A/.  However,  if  the  equa¬ 
tion  solution  time  is  larger  than  the  network  cycle  time,  or 
if  a  better  than  random  placement  of  the  circuit  nodes  on 
the  network  can  be  performed,  then  the  speed-up  factor 
will  be  closer  to  M.  Note  that  it  will  generally  be  true  in 
practice  that  M  >  P.  In  that  case,  more  than  one  node  will 
be  allocated  to  each  processor.  Techniques  for  the  imple¬ 
mentation  of  “  virtual  processors”  can  be  used  to  solve  this 
problem  [64],  and  scheduling  algorithms  can  be  used  to 
allocate  the  nodes  to  processors  in  such  a  way  that  network 
loading  is  uniform. 

It  is  clear  that  relaxation-based  algorithms  for  electrical 
simulation  are  well  suited  to  the  use  of  special-purpose 
hardware.  Future  work  in  this  area  includes  the  investiga¬ 
tion  of  the  best  match  of  hardware  and  algorithms,  and  the 
investigation  of  optimal  techniques  for  simulation  time 
advancement  in  a  parallel  computational  environment. 
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Abstract 

SPLICE  1  is  a  mixed-mode  simulation  program  for  large-scale  integrated 
circuits.  It  performs  concurrent  electrical  and  logic  simulation  using  event- 
driven  selective-trace  techniques.  The  electrical  analysis  uses  a  new  algo¬ 
rithm.  called  Iterated  Timing  Analysis  (ITA).  which  performs  accurate  electri¬ 
cal  waveform  analysis  much  faster  than  SPICE 2  for  large  circuits.  The  logic 
analysis  features  a  new  MOS-oriented  state  model  and  a  fanout  dependent 
delay  model,  and  handles  bidirectional  transfer  gates  in  a  consistent 
manner. 

This  report  describes  the  new  algorithms  and  the  details  of  the  imple¬ 
mentation  in  SP1JCEL7.  Program  performance  characteristics  and  a 
number  of  simulation  results  are  also  included. 
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CHAPTER  1 


1.  INTRODUCTION 


SPLICE  1  Is  a  mizvd-made  simulation  pregram  for  large  digital  MOS 
integrated  circuits  (IC).  It  performs  time-domain  transient  analysis  which 
tends  to  be  the  most  time-consuming  and  memory-intensive  task  in  simula¬ 
tion  today.  The  enhancements  made  to  the  program  are  described  in  this 
report.  The  starting  point  for  this  work  was  SPLICE1.3  [l].  This  early  ver¬ 
sion  of  SPLICEl  included  4-state  logic  simulation,  simple  timing  analysis  and 
a  SPICE-like  circuit  simulation  capability  [i,  2]. 

While  this  version  provided  a  degree  of  functionality,  it  suffered  from 
modeling  and  accuracy  problems  intrinsic  to  the  algorithms  used  in  the  pro¬ 
gram.  Specifically,  the  4-state  logic  model  was  not  sufficient  to  perform  an 
accurate  true- value  Logic  simulation  of  general  MOS  circuits  containing 
transfer  gates  and  wired  connections  (Le.,  more  than  one  gate  controlling 
the  state  of  a  node).  The  simple  timing  analysis  algorithm  had  inherent 
accuracy  limitations  and  stability  problems  and  had  difficulty  analyzing  cir¬ 
cuits  containing  floating  elements  and  tight  feedback  loops.  These,  and 
other  issues,  are  examined  in  detail  elsewhere  [3].  and  will  be  elaborated 
further  in  later  sections. 

The  latest  version,  SPLICE1.7,  overcomes  these  problems  by  using 
state-of-the-art  algorithms  in  place  of  previous  ones. 

The  electrical  analysis  is  performed  using  a  new  technique  called 
Itrrattd  Timing  Analysis  (ITA)  which  can  be  derived  from  simple  timing 
analysis  [4,5].  In  this  approach,  the  set  of  nonlinear  circuit  equations  are 
solved  using  a  relaxation-based  method  rather  than  a  method  which  requires 
the  direct  solution  of  a  set  of  linear  equations,  usually  found  in  standard 


v.v.v  .v  vt^IV.v  v 


circuit  simulators  such  as  SPICE2  [6].  ITA  is  as  accurate  as  SPICE2,  assum¬ 
ing  identical  device  models,  and  has  guaranteed  convergence  and  stability 
properties.  Due  to  the  selective  trace  feature  in  SPLICEl,  the  execution  time 
can  be  up  to  two  orders  of  magnitude  faster  than  SPICE2,  with  comparable 
waveform  accuracy,  for  large  circuits.  Another  key  feature  of  ITA  is  its  abil¬ 
ity  to  perform  accurate  analysis  of  complex  analog  circuits,  as  will  be  shown 
later.  Iterated  Tuning  Analysis  has  shown  so  much  promise  that  efforts  are 
being  directed  to  generalize  it  as  a  standard  technique  for  accurate  electri¬ 
cal  simulation.  Therefore,  a  matrix-oriented  simulation  capability  is  no 
longer  available  in  SPLICEl. 

The  logic  analysis  capabilities  have  also  been  extended  to  include  the 
notion  of  multiple  strengths  or  impedance  levels  [3]  as  is  available  in  most 
modern  MOS-oriented  logic  simulators  [7,8,9,10].  While  other  simulators 
usually  limit  the  number  of  strengths  to  three,  there  is  practically  no  limit  in 
SPUCE1.7,  which  allows  up  to  2,e  -  1  strengths.  More  than  three  strengths 
are  often  required  to  model  the  interaction  between  transfer  gates  of 
differing  geometry  [3].  Processing  of  the  gates  and  nodes  proceeds  in  a 
manner  similar  to  the  electrical  analysis.  In  fact,  the  logic  analysis  may  be 
thought  of  as  a  relaxation-based  method  in  which  the  elements  are 
represented  by  simple  logic  models  rather  than  complex  analytical  equa¬ 
tions.  This  concept,  together  with  the  idea  of  multipie  impedance  levels, 
allows  for  a  more  consistent  signal  representation  and  signal  conversion  in 
the  mixed-mode  environment.  Clearly,  there  is  a  correspondence  between 
an  electrical  voltage  and  the  logic  levels.  With  the  notion  of  strengths,  there 
is  now  a  natural  correspondence  between  the  electrical  output  conductance 
of  an  element  and  the  logic  output  strength  of  the  element 
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SPLICE  1  can  also  be  used  to  perform  switch- level  simulation  [9,  10]  to 
verify  circuit  functionality  at  the  transistor  level.  It  handles  CMOS,  NMQS  and 
PMOS  circuits  in  both  static  and  dynamic  configurations. 

Although  SPLICEl  originally  included  a  table  look-up  scheme  to  speed  up 
MOS  model  evaluation  [4,5],  it  was  subsequently  dropped  from  the  program. 
Research  on  optimal  table  models  and  structures  is  continuing  in  an 
independent  effort  [ll]  and  this  feature  may  be  reinstated  in  a  later  version. 
Therefore,  this  report  does  not  address  the  issue  of  table-driven  MOS  models. 

The  remainder  of  this  report  is  divided  into  four  chapters.  In  Chap.  2, 
the  ITA  algorithm  is  described  in  detaiL  The  enhancements  in  the  logic 
analysis  are  described  in  Chap.  3.  In  Chap.  4,  a  number  of  simulation  results 
and  program  performance  statistics  are  presented.  Finally,  in  Chapter  5,  the 
general  conclusions  are  stated  with  specific  mention  of  future  directions. 
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CHAPTER  2 


2.  Iterated  Timing  Analysis 

2.1.  Introduction 

A  new  form  of  electrical  analysis,  called  Iterated  Timing  Analysis  (ITA). 
is  described  in  this  chapter.  The  motivation  for  this  work  is  presented  using 
SPLICE  1.3  as  an  example  of  Non-iterated  Timing  Analysis  (NTA).  A  simple 
mathematical  treatment  of  the  ITA  method  is  presented  here  although  a 
complete  mathematical  analysis  of  relaxation-based  methods,  presented  in  a 
rigorous  and  unified  framework,  may  be  found  in  reference  [12].  The  details 
of  the  implementation  in  SPLICE1.7  are  also  included  in  this  chapter. 

2.2.  The  Simulation  Problem 

The  general  circuit  analysis  problem  in  the  time  domain  requires  the 
solution  of  a  set  of  first-order  nonlinear  Ordinary  Differential  Equations  (ODE) 
of  the  form; 

C(v(f),u(f))i;  = -/(v(f),u(f))  (2-1) 

where 

v(t )  is  the  set  of  unknown  node  voltages, 
u\t )  is  the  set  of  inputs, 

Cfi/ff  Luff ))  is  the  nodal  capacitance  matrix  , 

/  {v(t),u(t))  is  the  sum  of  the  currents  charging  the 
capacitances  at  each  node. 

This  formulation  can  be  derived  by  writing  KLrchaff's  Current  Law  (KCL)  at 
every  node,  except  the  ground  node,  in  a  given  circuit  [12].  The  Simula,  on 
task  is  to  determine  the  unknown  voltages,  v(t),  for  every  node  at  every 


timepoint  due  to  some  input  excitation,  u(t). 

The  technique  used  in  SPICE2  to  solve  Eqn.  (2-1)  is  to  first  convert  the 
set  of  differential  equations  into  a  set  of  algebraic  difference  equations  using 
a  stiffly-stable  integration  formula  [8].  The  nonlinear  difference  equations 
are  then  converted  to  a  set  of  linear  equations  of  the  form: 

GV=  I  (2-2) 

using  a  damped  Newton-Raphson  linearization  process.  G  is  the  Jacobian 
matrix  (or  the  small-signal  conductance  matrix),  V  is  the  unknown  voltage 
vector  and  1  is  the  known  excitation  vector.  Next,  Eqn.  (2-2)  is  solved  using  a 
direct  matrix  approach  to  produce  the  solution  vector,  V.  Since,  in  general, 
/(v(f).u(f))  is  a  nonlinear  function,  this  process  must  be  repeated  until  V 
converges  to  a  consistent  solution. 

2.3.  Motivation  for  a  New  Simulation  Approach 

General-purpose  simulation  programs,  such  as  SP1CE2  [8]  and  ASTAP 
[13],  have  been  used  extensively  to  perform  accurate  circuit  analysis  for 
over  10  years.  These  simulators  use  direct  methods  (using  sparse  matrix 
techniques)  to  solve  the  set  of  circuit  equations.  Unfortunately,  this 
approach  becomes  increasingly  expensive  as  the  circuit  size  increases.  The 
fundamental  problem  is  illustrated  in  Fig.  2. 1.  The  time  required  to  formu¬ 
late  the  set  of  linear  equations  grows  linearly  with  circuit  size  whereas  the 
time  required  to  solve  the  linear  equations  is  proportional  to  Nk,  where  N  is 
the  number  of  circuit  nodes  and  fc  ranges  from  1.1  to  1.5.  These  two  solution 
phases  are  referred  to  in  Fig.  2.1  as  FORM  and  SOLVE  respectively.  The 
SOLVE  phase  quickly  dominates  the  total  time  as  the  circuit  size  increases 
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and  this  is  one  reason  why  the  direct  approach  is  not  appropriate  for  targe 
circuits. 

Timing  simulation  was  introduced  in  the  mid-seventies  to  reduce  CPU- 
time  at  the  expense  of  some  accuracy.  A  new  breed  of  simulators  emerged 
at  that  time,  all  tailored  to  perform  transient  analysis  of  large  digital  circuits 
[5. 4, 1, 14, 15].  These  classical  timing  analysis  programs  used  iterative  tech¬ 
niques  [18]  to  solve  the  set  of  circuit  equations  rather  than  the  direct  matrix 
solution  approach  of  the  previous  generation.  A  grounded  capacitor  was 
required  at  every  node  to  guarantee  convergence  of  the  method.  However, 
to  reduce  execution  time,  none  of  these  programs  carried  the  iteration  to 
convergence  and  in  fact  each  node  equation  was  solved  once  at  each 
timepoint. 

Large  digital  circuits  typically  display  a  10-20  %  latency  characteristic. 
That  is,  only  10-20  %  of  the  nodes  in  the  circuit  are  active  at  any  given  time. 
Conceptually,  there  is  temporal  sparsity  (latency  in  a  waveform  over  a  time 
period)  and  spatial  sparsity  (latency  in  the  network  at  a  given  point  in  time) 
[17].  Since  these  iteration  methods  involve  the  solution  of  each  equation 
separately,  this  latency  aspect  can  be  exploited  to  further  improve  perfor¬ 
mance. 

Using  these  techniques,  two  orders  of  magnitude  of  speed  improvement 
was  obtained.  Accuracy  was  maintained  in  these  simulators  by  choosing  a 
small  fixed  timestep  for  the  entire  analysis.  This  timestep  was  either  con¬ 
stant  for  all  circuits  [5]  or  chosen  based  on  the  smallest  time  constant  in  the 
circuit  [4].  Later  timing  simulators  adjusted  the  timestep  during  the 
analysis  dynamically  to  limit  the  voltage  change  over  a  timestep  to  a  value 
specified  by  the  user  [lj.  Simple  timing  analysis  also  relied  heavily  on  the 
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2.1 :  the  amount  of  CPU  time  required  to  perform  a  transient  analysis 
of  a  «et  of  typical  circuits  of  increasing  size 
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fact  that  the  accumulated  voltage  error  becomes  zero  once  a  node  reaches 
either  of  the  two  supply  voltages. 

While  offering  a  substantial  savings  in  CPU-time  and  memory  usage, 
these  programs  suffer  from  a  number  of  problems  which  have  limited  their 
use.  Circuits  containing  global  feedback  loops,  such  as  the  ring  oscillator  of 
Fig  2.2(a),  produce  timing  and  voltage  errors  in  the  simulation.  Fig  2.2(b) 
illustrates  these  errors  as  generated  by  SPL1CE1.3  compared  to  the  solution 
produced  by  SPICE2.  The  SPLICE  1.3  program  not  only  produces  a  timing 
error  (phase  error)  but  incorrectly  predicts  the  number  of  cycles  in  a  given 
time  period  (frequency  error)  and  the  height  of  the  peaks  (amplitude  error). 

These  errors  are  all  due  to  the  single  iteration  performed  at  each 
timepoinL  To  understand  the  origin  of  the  errors,  consider  the  processing 


sequence  of  a  simple  NMOS  inverter  of  Fig.  2.3(a)  using  NTA. 

(1)  The  first  step  is  to  represent  each  nonlinear  device  by  its  corresponding 
linear  companion  model.  This  is  done  in  Fig.  2.3(b).  These  equivalent 
models  are  based  on  the  terminal  voltages  of  each  device.  The  conduc¬ 
tance  is  obtained  from  the  slope  of  the  nonlinear  I-V  characteristic  at 
the  operating  point  and  the  current  is  obtained  from  the  y-euds  inter¬ 
cept  as  shown  in  Fig.  2.3(c).  The  value  of  the  voltage  at  Node  C  is  calcu¬ 
lated  using  this  equivalent  circuit  of  Fig.  2.3(b).  Assume  that  initially 
Vfcn-1=Ov  and  Vcn~l-5.0v ,  where  Vgn~l  refers  to  the  voltage  at  node  B 
at  time  f„_x  and  Vc1*"1  has  a  similar  definition. 

(2)  Let  t^n=l.&u.  Then  the  change  in  the  voltage  at  node  C  is  calculated 
using  Vgn  and  Vp"-1-  Therefore,  the  linear  equivalent  model  of  the  load 
is  the  same  as  it  was  at  but  the  equivalent  model  of  the  driver 
changes.  Since  the  load  offers  less  charging  current  than  it  really 
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should  (i.e.,  Vc  is  incorrect)  and  the  driver  is  able  to  sink  more  current 
due  to  a  larger  V&,  the  node  voltage  change  is  too  optimistic. 

(3)  When  Vsn*l~2.0v  ,  again  Vcn*l=f  (Vcn ,Vgn*1)  and  AVen'fl  is  also  optimis¬ 
tic  by  the  same  argument  given  above. 

Hence,  in  a  SPLICE  1.3  simulation,  the  output  of  the  inverter  will  rise  and 
fall  earlier  in  time  with  faster  rise  and  fall  times  than  the  SPICES  simulation 
of  the  same  circuit.  This  error  is  propagated  and  intensified  in  the  ring  oscil¬ 
lator  circuit,  resulting  in  the  three  errors  cited  above.  It  should  be  noted 
that  if  the  timestep  of  the  simulation  is  reduced  and  the  accuracy  tolerances 
are  tight,  the  NTA  output  will  be  indistinguishable  from  SPICE2  output  for 
this  example. 

Another  shortcoming  of  NTA  is  that  it  has  some  difficulty  dealing  with 
circuits  containing  floating  elements,  such  as  capacitors  and  transfer  gates. 
These  elements  introduce  strong  bilateral  coupling  between  two  nodes  in  the 
circuit.  Since  only  one  iteration  is  used,  the  solution  obtained  using  NTA 
depends  on  the  order  that  the  nodes  are  processed.  Consider,  for  example, 
the  2-input  NMOS  NAND  circuit  of  Fig.  2.4(a).  It  contains  a  "floating"  transis¬ 
tor,  namely  M2.  The  sequence  of  processing  in  the  NTA  method  would  be  as 
follows : 

(1)  Assume  that  initially  Vxn~1=Ov,  >y*-l=5.0i/  and  7rn-1=Ch; 

(2)  At  t „,  IV* =1,0 v  and  Node  X  is  processed  using  the  initial  conditions 
V*-"-1  and  V1  given  above  to  produce  Vxn. 

(3)  Next,  Node  Xis  processed  using  V/1"1  and  Vxn  to  generate  Vyn. 

(4)  Then,  time  is  advanced  by  one  unit  and  steps  (2)  to  (4)  are  repeated. 
This  process  continues  until  Node  Y  makes  its  transition  to  the  opposite 
rail  voltage. 


There  are  two  problems  with  this  method: 

•  the  change  in  node  Y  should  immediately  affect  node  X  but  it  is  not 
reflected  at  node  X  until  the  next  time  point. 

•  if  the  processing  started  with  node  Y  instead  of  node  X.  slightly  different 
results  would  be  obtained. 

The  same  effect  is  observed  when  processing  capacitors  where  one  node 
is  not  connected  to  ground  (i.e.,  a  floating  capacitor).  An  example  of  a  cir¬ 
cuit  with  such  a  capacitor  is  the  boot-strapped  inverter  of  Fig  2.4(b).  The 
accuracy  of  NTA  depends  on  the  timestep  and  the  ratio  of  the  floating  capaci¬ 
tor  to  the  grounded  capacitor.  In  the  boot-strapped  inverter,  the  value  of 
C03  is  usually  large  compared  to  the  grounded  capacitors.  C01  and  C02,  and 
this  tends  to  reduce  the  accuracy  of  the  solution  produced  by  NTA. 

Therefore.  NTA  will  produce  somewhat  inaccurate  results  when  there  are 
floating  elements  in  the  circuit  Reducing  the  timestep  to  an  appropriate 
value  will  improve  the  accuracy,  but  if  the  timestep  is  not  small  enough, 
these  elements  may  cause  the  simulator  to  exhibit  instability.  As  will  be 
seen  later  in  this  section,  the  ITA  approach  overcomes  all  of  these  problems. 

By  far  the  moat  compromising  aspect  of  the  NTA  approach  is  that  it  may 
occasionally  produoe  the  wrong  answer!  Circuit  designers  are  willing  to  use  a 
program  which  gives  them  the  correct  answer  or  no  answer  (usually  due  to 
non-convergence),  but  are  unable  to  deal  with  a  program  that  occasionally 
produces  the  wrong  answer.  In  fact,  the  NTA  method  always  produces  some 
answer  and  this  is  really  the  downfall  of  the  method. 

For  the  reasons  given  above,  timing  analysis  has  not  been  widely 
accepted  as  viable  form  of  electrical  simulation,  although  it  has  been  used 
successfully  in  constrained  1C  design  methodologies  such  as  standard  cell  or 
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fig.  2.4(a)  :  2-input  NMOS  NAND  circuit  The  floating  transistor 
”  oonnected  between  nodes  X  Y.  end  W  causes  problems  for  NTA. 
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gate  array.  What  is  really  required  is  a  simulation  technique  which  provides 
both  accuracy  and  speed. 

2.4.  Relaxation-based  Electrical  Simulation 

A  number  of  new  techniques  have  been  developed  in  an  effort  to  reduce 
the  simulation  time  while  maintaining  waveform  accuracy  comparable  to 
SPICES.  These  include  table-driven  model  evaluation  [  11],  microcode  tailor¬ 
ing  on  a  minicomputer  [18]  and  the  use  of  vector-oriented  computers  such 
as  the  CRAY-1  [19].  Although  these  techniques  have  been  successful,  they 
provide,  at  most,  an  order  of  magnitude  speed  improvement  over  SP1CE2. 

Two  methods  are  currently  being  investigated  which  use  a  converged 
relaxation  iteration  to  solve  the  set  of  circuit  equations.  Both  approaches 
have  been  implemented  and  preliminary  results  indicate  that  up  to  two  ord¬ 
ers  of  magnitude  of  speed  improvement  may  be  obtained  for  large  digital  cir¬ 
cuits.  One  method,  called  Waveform  Belaxation[20],  decomposes  the  system 
of  equations  into  several  dynamic  subsystems  each  of  which  is  analyzed  for 
the  entire  simulation  period.  The  process  is  then  repeated  until  all  the 
waveforms  converge  to  an  exact  solution  The  relaxation  is  performed  at  the 
differential  equation  level.  This  method  has  been  implemented  in  program 
RELAX  [20,21]. 

The  second  method  is  called  Iterated  Timing  Analysis  (ITA)  [22, 23].  In 
this  method,  the  relaxation  is  performed  at  the  nonlinear  equation  level. 
That  is,  the  set  of  nonlinear  circuit  equations  are  iterated  to  convergence 
using  a  Gauss-Seidel  or  Gauss-Jacobi  method.  This  is  also  an  exact  method. 
Some  aspects  of  this  method  which  make  it  attractive  are  as  follows: 


•  it  has  guaranteed  convergence  and  stability  properties 

•  it  allows  circuit  latency  to  be  exploited  easily 

•  it  can  be  implemented  using  the  concepts  developed  for  logic  simulation 

•  since  the  logic  and  electrical  analyses  operate  the  same  way,  a  con¬ 
sistent  mixed-mode  simulation  is  possible 

The  algorithm  has  been  implemented  in  SPLICE1.7  and  the  implementa¬ 
tion  details  and  results  obtained  are  presented  in  this  chapter  following  a 
simple  mathematical  treatment  of  the  method. 

2.5.  The  ITA  Algorithm 

2.5.1.  The  Gauss-Seidel  Iteration  Method 

A  system  of  simultaneous  linear  equations  can  be  solved  using  a  variety 
of  techniques,  namely: 

1.  Direct  Methods 

a.  Matrix  Inversion 

b.  Gaussian  Elimination 

c.  LU  decomposition 

2.  Iterative  Methods 

a.  Gauss-Jacobi 

b.  Gauss-Seidel 

In  circuit  simulation,  the  solution  to  Eqn.  (2-2)  is  required.  The  circuit  con¬ 
ductance  matrix,  G,  is  usually  large  but  sparse,  typically  having  3  elements 
per  row.  Matrix  inversion  is  not  a  suitable  method  because  it  usually  con¬ 
verts  a  sparse  matrix  into  a  dense  one.  Sparse  matrix  techniques  can  be 
used  to  solve  the  equations  using  method  l(b)  or  1(c)  but  this  is  not  suitable 
for  large  circuits  due  to  the  rapid  increase  in  CPU-time,  as  shown  in  Fig.  2.1. 


The  iterative  methods  [16]  are  well-suited  to  cases  where  the  matrix  is 
sparse.  In  fact,  the  solution  of  a  set  of  sparse  linear  equations  may  be 
obtained  faster  using  an  iterative  approach.  Two  classical  iteration  methods 
exist:  the  Gauss*Jecobi  (G-J)  method  and  the  Gauss-Seidel  (G*S)  method.  The 
Gauss -Jacobi  mettod  (also  referred  to  in  the  literature  as  simultaneous  dis¬ 
placement)  proposes  the  following  approach: 

=  initial  guess  voltage  vector 
m«-0 
repeat  { 

for  (i  =  1  to  N  )  { 


i 

j  until  |vj’*+1-t>/n|se  for  all  i,  Le.,  convergence 

Notice  that  every  equation  uses  the  previous  iteration  values  for  all  unk¬ 
nown  voltages  to  obtain  a  new  solution  vector.  The  Gauss-Seidel  method  (also 
referred  to  as  successive  displacement)  suggests  the  following  modification 
to  Gauss-Jacobi: 


ss  initial  guess  voltage  vector 
m«-0 
repeat  { 

for  (i  s  1  to  N  )  J 


m«-m  +1 

j  until  j  utm|ir  for  all  i,  Le.,  convergence 
Notice  that  each  equation  uses  the  latest  values  of  voltage  wherever  pos- 
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The  only  difference  in  the  two  methods  is  whether  the  previous  voltages 
are  always  used  or  the  latest  values  are  applied  immediately.  The  c  nver- 
gence  rate  is  linear  in  both  cases  but  the  speed  of  convergence  is  quite 
different.  Usually  the  Gauss-Seidel  iteration  converges  faster  than  Gauss- 
Jacobi  [16],  although  there  are  cases  where  this  is  not  true. 

Both  methods  also  require  the  strict  diagonal  dominance  condition  for 
guaranteed  convergence: 


This  inequality  states  that  each  diagonal  term  of  the  matrix  be  greater  than 
the  sum  of  all  the  off-diagonal  terms  in  the  same  row. 

An  acceleration  scheme  is  available  to  speed  up  convergence  using  an 
acceleration  parameter,  o,  as  follows. 

(2-6) 

where  is  an  intermediate  value  generated  using  Eqn.  (2-4).  The  effect  of  o 
is  usually  dramatic  but  it  can  only  be  obtained  empirically  and  usually  varies 
from  technology  to  technology.  For  standard  Gauss-Seidel,  u  =1. 

2.5.2.  A  Nonlinear  Gauss-Seidel  Iterative  Approach 

Relaxation  methods,  as  described  in  the  previous  section,  can  also  be 
applied  successfully  at  the  nonlinear  equation  level.  The  same  approach  is 
used  as  for  linear  equations  except  that  each  nonlinear  equation  must  first 
be  linearized  and  solved  before  proceeding  to  the  next  equation.  Using  this 
approach,  the  time-consuming  effort  required  to  calculate  the  Jacobian 
matnx  entries  can  be  avoided. 


The  steps  at  the  nonlinear  equation  level  are  as  follows.  Starting  with 
equation  (2-1).  the  first  step  is  to  convert  the  differential  equations  into 
difference  equations  using  a  itiffly-stable  integration  formula.  SPLICE1.7 
uses  a  Backward-Euler  formulation  [24].  Then  the  first  equation  is  linearized 
using  the  Newton-Rap hson  (N-R)  method  and  iterated  to  convergence  to  solve 
for  one  unknown  voltage.  This  constitutes  the  inner  N-R  loop.  The  same  pro¬ 
cess  is  applied  to  the  next  equation  and  all  subsequent  equations,  in  turn, 
until  the  last  equation  is  processed.  This  outer  G-S  loop  is  now  iterated  to 
convergence  to  produce  the  solution. 

To  further  illustrate  the  method,  consider  the  solution  method  applied 
to  one  node  in  a  typical  circuit.  Rg  2.5(a)  shows  three  nonlinear  devices  con¬ 
nected  to  Node  4,  which  has  a  capacitor  connected  to  ground.  We  begin  by 
writing  KCL  for  Node  4 

$,  Ij  -  h+h+h+h  -  0  (2-7) 

J-i 

This  can  be  rewritten  in  the  form  of  Eqn.  (2-1) : 

C4V4  -  -(/i(V,V4)  +  Ie(Vz.V4)  +  IS(V3.V4))  (2-8) 

Using  the  Backward-Euler  formula  for  I4  ,  we  obtain 

/.  =  =  |k!V„r  **.-»)  <2-9) 

where  h  is  the  integration  step  size,  refers  to  the  voltage  value  for  Node 
4  at  time  f*  and  refers  to  the  solution  obtained  for  Node  4  at  time 

Therefore,  Eqn.  (2-8)  can  now  be  written  as  a  difference  equation 

X<KKn)-n(n-i))  +  +  h(V*V4)  +  /3(V3.V4)  =  0  (2-10) 


Since  Eqn.  (2-10)  has  the  form 
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V.’.' 

‘O 


/(Vi.VS!>V9.K4)  =  0 


(2-11) 


it  is  suitable  for  the  Newton-Raphson  (N-R)  iterative  method  with  V4  as  the 


unknown  variable.  The  general  equation  for  one  N-R  iteration  is 


-{<♦«  =  x«) 


-  / 

fWT 


(2-12) 


In  circuit  terms,  the  N-R  calculation  usually  requires  that  a  linear  equivalent 
be  determined  for  each  nonlinear  device  connected  to  the  node,  as  shown  in 
Fig.  2.5(b)  for  DL  This  involves  the  calculation  of  a  conductance,  Gq  and  a 
current  intercept,  Iq.  In  order  to  avoid  the  intercept  calculation,  we  can 
apply  Eqn.  (2-11)  directly  to  Eqn.  (2-12)  to  get 


v*[n)  ~  v*n)  /•(v1,K2.v„v4) 


(2-13) 


Now  set  AK4(ll)i+,=  y4(n)<+1-Vi(jl)  and  substitute  Eqn.  (2-10)  into  Eqn.  (2-13)  to 


4  W’  »  liL 


(2-14) 


where  refers  to  the  ith  iteration  value  of  voltage  at  Node  4  at  time  i* 
and  Ij  refers  to  the  ith  iteration  value  of  current  at  Node  j..  This  method  of 
evaluating  A  V  is  convenient  because: 

•  no  intercepts  need  to  be  calculated  since  total  currents  are  used  in  Eqn. 
(2-14) 

•  current  levels  are  within  operating  ranges  (unlike  /o  in  Fig.  2.5(b)) 

•  the  value  of  A  V  is  very  accurate  when  calculated  this  way.  Note  that  A  V 
is  the  difference  between  two  Newton  iterations  and  it  will  tend  toward 


21 


zero  with  each  iteration.  Therefore  it  should,  be  calculated  as  accurately 
as  possible. 

For  an  arbitrary  node  Eqn.  (2-14)  becomes 

OS-Vfn-o) 

-  J -  - - -  (2-15) 

2.5.3.  The  SOR-Newton  Iteration 

A  combination  of  the  Newton-Raphson  iteration  in  a  converged  Gauss- 
Seidel  loop  with  acceleration  applied  is  called  the  SOR-Newton  method.  In 
equation  form,  it  is  simply 

In  a  standard  N-R  iteration,  the  equation  is  iterated  until  \LV\&t.  This 
means  that  each  node  equation  should  be  iterated  to  convergence  before 
moving  on  to  the  next  one.  The  Gauss-Seidel  loop  (i.e.,  the  outer  loop)  must 
also  be  iterated  to  convergence. 

2.5.4.  Convergence  of  the  SOR-Newton  Iteration 

A  very  important  property  of  the  SOR-Newton  iteration  can  be  applied 
now  to  greatly  reduce  the  number  of  iterations  of  the  inner  N-R  loop.  It  hap¬ 
pens  that  one  Newton  iteration  per  equation  for  each  G-S  iteration  is 
sufficient  to  retain  the  convergence  properties  of  the  nonlinear  Gauss-Seidel 
iteration  [16]  as  long  as  the  convergence  requirements  of  the  N-R  iteration 
are  strictly  satisfied. 

A  Newton-Raphson  iteration  will  converge  if  the  initial  guess  is  "close 
enough"  to  the  exact  solution,  given  that  the  function  is  Lip  sc  hi  tz 


continuous.  Under  these  conditions,  the  rate  of  convergence  is  quadratic. 
Since  the  element  model  equations  are  smooth,  the  solution  from  one 
timepoint  to  the  next  will  not  be  drastically  different  Therefore,  the  solution 
at  the  previous  timepoint  is  a  good  first  guess  for  the  N-R  iteration.  Further¬ 
more;  a  prediction  step  may  be  used  to  generate  a  better  first  guess.  A  sim¬ 
ple  linear  predictor  is  used  in  SPLICE  1  using  the  previous  two  solution  points. 

The  diagonal  dominance  requirement  for  the  G-S  iteration  must  also  be 
satisfied  to  guarantee  the  convergence  of  the  SOR-Newton  iteration.  In  cir¬ 
cuit  terms,  this  requirement  can  always  be  met  by  placing  a  grounded  capa¬ 
citor  at  every  node  and  choosing  an  appropriate  timestep.  Grounded  capaci- 

Q 

tors  appear  as  —  terms  in  the  diagonal  position  of  the  conductance  matrix 

/l 

G.  Therefore,  h,  which  is  the  simulation  timestep.  can  be  reduced  until  the 

Q 

—term  is  greater  than  the  stun  of  all  off-diagonal  terms. 

Off-diagonal  terms  appear  in  the  conductance  matrix  when  there  is  cou- 

Q 

pling  between  two  nodes.  For  example,  when  floating  capacitors  are  used,  — 

terms  appear  in  diagonal  and  off-diagonal  positions.  Therefore,  reducing  the 
value  of  h  is  not  as  effective  and  this  may  lead  to  convergence  problems.  The 
ratio  of  the  floating  capacitor  to  the  grounded  capacitor  is  an  important  fac¬ 
tor  in  determining  the  speed  at  which  convergence  is  achieved.  If  the  float¬ 
ing  capacitor  is  very  large  compared  to  the  grounded  capacitor,  convergence 
speed  will  be  slow,  if  the  iteration  converges  at  all.  The  current  version  of 
SPLICE  uses  the  IIE  method  (Implicit-Implicit-Explicit)  [25]  to  evaluate  float¬ 
ing  capacitors. 


2.6.  Exploiting  latency 

SPLICE  1  does  not  solve  every  node  at  every  timepoint.  In  fact,  only 
those  nodes  which  are  active  at  any  given  point  in  time  are  processed.  Since 
large  circuits  are  relatively  inactive,  less  than  20%  of  the  nodes  are  actually 
solved  at  each  timepoint.  The  active  nodes  are  determined  on  an  event- 
driven  basis.  That  is,  a  node  is  placed  in  the  set  of  active  nodes  if  any  node 
which  can  affect  it  changes  by  a  significant  amount. 

Once  the  set  of  active  nodes  are  identified.  SPLICE1  can  exploit  two 
forms  of  latency.  The  first  one  is  called  simply  latency  in  time.  This  is  based 
on  the  fact  that  digital  circuit  waveforms  feature  long  constant  periods.  An 
active  node  is  processed  at  consecutive  points  in  time  until  it  reaches  a  con¬ 
stant  value.  It  is  then  removed  from  the  set  of  active  nodes  and  becomes 
latent.  The  second  form  of  latency  is  the  so-called  latency  at  a  timepoint. 
This  refers  of  the  fact  that  some  nodes  may  actually  converge  with  fewer 
iterations  than  others,  at  a  given  timepoint.  These  nodes  can  be  marked  to 
be  processed  at  the  next  timepoint  while  the  remaining  nodes  continue  to 
iterate  to  convergence  at  the  current  timepoint.  Tightly-coupled  nodes  usu¬ 
ally  require  more  iterations  than  other  nodes. 

The  decoupled  nature  of  IT  A.  allows  both  forms  of  latency  to  be  exploited 
efficiently.  These  techniques  reduce  the  overall  computation  significantly. 
Of  course  standard  circuit  simulators  solve  every  node  at  every  timepoint 
and  all  nodes  converge  simultaneously. 

2.7.  Implementation  in  SPLJCE 

The  analysis  techniques  described  in  the  previous  sections  have  been 
implemented  in  SPLICE1.7.  The  details  are  described  in  this  section  with 


special  attention  given  to  areas  where  further  optimization  would  improve 
the  simulator  performance. 


SPLICEl  has  a  fixed  minimum  timestep  called  the  mrt  (minimum  resolv¬ 
able  time).  Events  can  only  be  scheduled  at  integer  multiples  of  mrL  There 
is  a  scheduling  threshold  parameter  called  mindvsch  which  is  the  minimum 
change  in  a  node  voltage  over  a  timestep  which  causes  the  fanout  elements 
of  the  node  to  be  scheduled.  The  convergence  criterion  is  defined  by  two 
parameters  called  ahstol  (absolute  tolerance)  and  reltol  (relative  tolerance). 


2.7.1.  Program  now 


The  program  flow  has  not  changed  since  the  SPLICE1.3  release.  The 
details  of  the  processing  may  be  found  in  [l.3]  and  are  not  repeated  here. 
The  data  structures  of  the  ITA  as  implemented  in  SPLICEl. 7  are  given  in 
APPENDIX  IL  The  general  program  flow  for  electrical  analysis  is  as  follows: 
set  all  nodes  to  their  initial  values  ; 

schedule  all  FOL's  at  time  0 ;  #F0L  =  FanOut  list  of  a  node 

; 

while  (  <  TSTOP  )  \ 

foreach  (FOL  in  the  queue  at  the  current  timepoint)  $ 
foreach  (element  in  the  FOL)  ( 

foreach  (output  node  of  an  element)  { 

process  node  ;  #see  next  section  for  details 
schedule  FOL  if  necessary ; 


plot  all  requested  active  nodes  ; 

J 


2.7.2.  Details  of  Node  Processing 


A  subroutine  in  SPLICE1.7  processes  all  electrical  nodes,  calculates  the  new 
node  voltage,  decides  whether  the  node  has  converged  and  determines 


whether  subsequent  scheduling  is  necessary.  A  high-level  pseudo-code 
description  of  the  routine  is  as  follows: 


begin 

§  Iterated  timing  analysis  algorithm  in  SPLICE1.7 
#  Node  processing  sequence 
.  obtain  next  node  m; 
if  (first  time  processed  at  new  time  point)  J 

use  last  two  points  to  perform  linear  prediction  ; 
convfig=false ; 

Gnet  =  Inet  =  0 ; 

for  (  each  fanin  element  at  node  m)  ( 

compute  equivalent  conductance  Geq; 
compute  total  current  flowing  into  node  leq; 

Gnet  —  Gnet  +  Geq; 

Inet  =  Inet  +  leq; 

l 

oalculate  AF ;  fchange  in  voltage  over  an  iteration 
•,  Ifneyr  node  voltage 

DV  =  |  :  #  change  in  node  voltage  over  one  timestep 

If  (AV  <  tolerance)  j  #  node  has  converged 

if  (convflg  as  false)  j  #have  not  converged  at  this  timepoint  before 
if  (  DV  >  mindvsched )  }  #  node  change  is  significant 
schedule  current  fol  at  Tn+i  (future) ; 
schedule  fol  of  node  at  Tn  (now) ; 
convflg  =  true  ; 

else  {  #  node  change  is  not  significant  over  one  timestep 
^  do  nothing  ; 

else  #have  converged  previously  at  this  timepoint 
do  nothing  ;  fbreak  any  feedback  loops 

else  f  #  node  has  not  converged  so  keep  processing 
convflg  =  false  ; 

schedule  current  fol  at  Tn  (now) ; 
schedule  fol  of  node  at  Tn  (now)  ; 

§  Finished  this  node  for  this  iteration 

return 

•nd 


2.7.3.  Element  Models 


SPLICE  1.7  has  built-in  models  for  resistors,  linear  capacitors  (floating 
and  grounded),  diodes  and  MOS  transistors.  The  HE  method  is  used  for 
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floating  capacitors  [25]  and  the  first-order  Schichman-Hodges  model  [26] 
equations  are  used  to  model  MOS  transistors.  The  model  equations  for  each 
device  are  given  in  APPENDIX  III. 

Each  electrical  element  has  a  corresponding  program  subroutine.  The 
subroutine  evaluates  the  linear  equivalent  model  for  each  nonlinear  device 
and  returns  it  to  the  calling  routine.  As  mentioned  previously,  the  intercept 
current  calculation  can  be  avoided  by  a  simple  reformulation  of  the  equa¬ 
tions.  Using  this  approach,  the  conductance  and  the  total  current  at  a  given 
operating  point  is  returned  by  each  subroutine.  The  calculation  of  the 
equivalent  model  assumes  that  all  other  nodes  have  ideal  constant  voltage 
sources  attached  to  them,  except  in  the  case  of  floating  capacitors,  since  IIE 
is  used. 

2.8.  1TA  Simulation  Results 

This  chapter  has  been  concerned  mainly  with  simulation  accuracy  and 
would  not  be  complete  without  a  comparison  of  ITA  with  SPICE2.  Tig.  2.6 
shows  the  simulation  results  obtained  for  the  ring  oscillator,  2-input  NAND 
and  boot-strapped  inverter  circuits  described  earlier.  As  indicated  by  the 
results,  SPLICE1.7  produces  results  which  are  indistinguishable  from  those 
obtained  by  SPICE2  except  at  timepoints  near  time  zero  due  to  different  ini¬ 
tial  value  assumptions.  Therefore,  circuits  which  handled  inadequately  using 
NTA  do  not  pose  a  problem  to  ITA  in  terms  of  accuracy. 

The  run-times  of  the  3  examples  do  not  demonstrate  the  speed  advan¬ 
tage  of  ITA  because  the  circuits  are  all  very  small  with  dense  G  matrices  and 
small  circuits  tend  to  be  very  active.  The  selective  trace  feature  in  SPLICE  is 
a  significant  advantage  in  very  large  circuits. 
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(a)  Ktaf  Oscillator  Output. 

The  circuit  la  shown  in  Fig.  2.2(a) 


f!f.  8.6 :  A  comparison  at  the  aceurmcy  of  1TA  rw.  SPIGE2 
KTA  had  problems  with  each  drcuit 
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2.9.  Optimizations  in  the  Present  Implementation 


While  the  data  structures  used  in  SPLICE  1  are  well-suited  to  handle  cir¬ 


cuit,  timing  and  logic  simulation  concurrently,  they  are  not  ideal  for  ITA.  If  a 
separate  program  were  written  to  perform  ITA,  several  optimizations  could 
be  made  to  improve  the  program  performance. 


For  example,  some  nodes  may  be  reprocessed  after  they  have  con¬ 
verged  because  there  may  be  several  paths  to  the  same  node  through 
different  elements.  A  node  may  also  be  processed  many  times  in  succession 
before  another  node  is  processed  (i.e.,  two  or  more  Newton  iterations). 
Furthermore,  there  are  many  levels  of  indirection  which  must  be  traversed 
in  order  to  reach  a  node,  as  shown  in  a  previous  section. 


These  problems  can  be  eliminated  by  scheduling  and  processing  nodes 

as  opposed  to  fanout  lists.  One  such  scheme  which  uses  two  buffers,  Ej  and 

Eg,  avoids  reprocessing  a  node  before  all  other  active  nodes  are  processed: 

put  all  nodes  in  event  list  £^(0)  ; 

<»<-0; 

while  (  <TSTOP  )  { 
k*-0 ; 

while  (  event  list  is  not  empty  )  | 

foreach  ( i  in  £<(**)  | 
obtain  AF; 

; 

if  (  )  {  i.e.,  if  convergence  is  achieved 

add  node  i  to  list  £j(fn+i) : 

else  l 

add  node  i  to  event  list  £^(fn) ; 
add  fanout  nodes  of  node  i  to  event  list  EA(tn) 
if  they  are  not  already  there  ; 

j  Eg{tn) i-empty  ; 

fn*-tn+,  ;  tn+l  =  next  timepoint 


Another  shortcoming  of  the  current  implementation  is  that  if  a  node 
does  not  converge  at  a  timepoint,  the  program  simply  stops  execution.  The 
user  must  decrease  the  timestep  manually  and  re-run  the  entire  simulation. 
An  automatic  internal  timestep  control  mechanism  would  be  useful  not  only 
for  the  convergence  problem  but  also  for  error  control.  If  the  error  is  small 
at  a  particular  timepoint,  then  the  timestep  could  be  increased.  If  the  error 
is  too  large,  the  timestep  could  be  decreased.  The  nodes  would  then  be  re¬ 
evaluated  at  the  new  timepoint.  Hence,  the  timestep  could  be  computed 
based  on  an  estimate  of  the  Local  Truncation  Error.  In  fact,  each  node  could 
have  its  own  mrt.  independent  of  other  nodes,  as  long  as  some  consistency  is 
maintained  in  the  simulation  between  different  nodes.  Unfortunately, 
dynamic  timestep  control  requires  the  ability  to  "backup"  in  time  (Le.,  a 
buffering  of  previous  results  for  each  node)  and  requires  a  modification  of 
the  data  structures  to  allow  successive  refinement  of  the  mrt  (minimum 
resolvable  time)  in  the  time  queue  [27].  For  this  reason,  it  would  require  a 
considerable  amount  of  effort  to  test  this  scheme  in  the  current  SPLJCEl 
environment. 


CHAPTER  3 


3.  Enhancements  bo  the  Logic  Analysis 

3.1.  Introduction 

The  improvements  in  the  logic  analysis  of  SPLICEl  are  described  in  this 
chapter.  The  starting  point  for  this  work  was  SPLICE  1.3.  It  had  the  following 
features: 

•  a  4-state  logic  model  (0.1.X.Z) 

•  fixed  assignable  rise  and  fail  delays  on  all  gates 

•  unidirectional  and  some  bidirectional  elements  handled. 

There  have  been  a  number  of  changes  in  the  logic  analysis  since  the 
SPLICE  1.3  release.  These  changes  were  made  to  alleviate  some  of  the  prob¬ 
lems  in  the  previous  version  and  to  facilitate  conversions  in  the  mixed-mode 
environment. 

The  new  version  is  SPLICE  1.7  which  features: 

•  a  new  MOS-oriented  state  model 

•  a  fanout  dependent  delay  model 

•  unidirectional  and  generalized  bidirectional  element  processing. 

The  logic  analysis  is  performed  using  a  relaxation-based  method,  similar 
in  nature  to  the  electrical  analysis.  In  fact,  the  logic  analysis  can  be  thought 
of  aa  an  implementation  of  non-iterated  timing  analysis  (see  Chap.  2)  with 
simplified  element  models.  Each  logic  node  carries  information  about  the 
node  voltage  and  the  equivalent  conductance-to-ground,  as  does  the  electri¬ 
cal  node.  Therefore,  the  mixed-mode  interface  is  defined  in  a  consistent 
manner  in  SPLICE  1.7. 


This  chapter  begins  with  a  description  and  definition  of  the  new  state 
model.  Following  this,  the  delay  model  is  described.  Next,  the  "spike"  detec¬ 
tion  and  handling  procedure  is  presented.  A  spike  is  a  pulse  at  a  node  of 
shorter  duration  than  the  minimum  width  necessary  to  trigger  subsequent 
gates.  This  is  usually  an  error  condition  which  must  be  identified  and 
reported  to  the  user.  In  the  next  section,  the  important  issues  pertaining  to 
the  MOS  transfer  gate  are  reviewed.  The  transfer  gate  (or  transmission  gate, 
or  pass  transistor)  is  the  source  of  many  MOS  modeling  problems  at  the  logic 
level  and  the  reason  for  this  will  become  clear  in  this  chapter.  The  logic 
analysis  algorithm  will  then  be  presented  in  the  section  which  follows. 
SPLICE1.7  can  also  be  used  to  perform  switch-level  simulation  [9. 10]  and  this 
is  described  in  the  last  section.  Background  material  on  MOS  logic  simula¬ 
tion  may  be  found  in  reference  [3]. 

3.2.  The  State  Model 

3.2.1.  AMOS-orienbed  Logic  Model 

Most  modern  logic  simulators  handle  the  problems  specific  to  MOS 
integrated  circuits  by  including  the  notion  of  signal  strength (7, 8, 9. 10]  in  the 
logic  model  The  rationale  for  this  has  been  presented  in  a  previous  publica¬ 
tion  [3].  Strength  is  an  abstraction  of  the  large-signal  conductance  from  a 
node  to  ground  or  from  a  node  to  a  supply  voltage.  It  can  be  associated  with 
the  output  of  a  gate  or  it  can  be  an  attribute  of  a  node.  For  example,  in  the 
inverter  of  Fig.  3.1  (Ml  and  M2),  the  driver  transistor  with  its  gate  input  at 
5.0V  represents  a  very  low  resistance  path  from  Node  B  to  ground.  In  MOS 
logic  model  terms,  this  is  referred  to  as  a  "forcing  0"  or  "driving  0".  Simi¬ 
larly,  the  load  transistor  represents  a  sizeable  resistance  from  Node  B  to  VDD 


strength 


(b) 


Flf.  3.1 :  The  circuit  in  («)  Oluatntes  the  use  at  the 

■trenfth-erlented  HOS  model.  The  graph  in  (b)  shoe*  the 
reletionehip  between  the  strengths  end  levels  in  t  9-etete 
logic  modeL 


(approx.  20kfi  to  40k0)  and  this  is  referred  to  alternatively  as  a  "soft  1",  a 
"resistive  1"  or  a  "weak  1”.  If  transistor  M3  is  turned  "OFF’  (that  is,  if  the 
gate  voltage  is  zero  for  an  NMOS  transistor),  Node  C  goes  into  a  "high- 
impedance"  condition  which  represents  a  third  distinct  strength.  The  rela¬ 
tionship  between  strengths  and  levels  are  shown  in  Fig.  3.1(b).  Although 
most  simulators  are  based  on  these  three  strengths,  SPLICE  1.7  allows  up  to 
2ia  -1  strengths  for  two  reasons: 

•  there  is  a  requirement  for  more  than  three  strengths  when  modeling  the 
interaction  of  several  transfer  gates  with  differing  W/L  ratios,  typically 
found  in  bus  contention  situations. 

•  it  provides  a  mechanism  for  consistent  signal  representation  in  the  logic 
domain  for  schematic  or  mixed-mode  simulation  [22].  If  information 
about  the  effective  conductance  to  ground  is  stored  with  each  electrical 
node,  this  information  could  be  converted  to  a  strength  value  and 
passed  on  to  the  logic  node,  along  with  the  voltage  information,  when¬ 
ever  there  is  a  requirement  to  do  so.  Conversions  in  the  opposite  direc¬ 
tion  can  be  performed  in  a  similar  manner.  In  this  way,  simulation 
accuracy  can  be  maintained  in  the  mixed-mode  environment. 

3.2.2.  State  Model  Definition 

The  state  model  used  in  SPLICE  1.7  is  now  formally  defined  : 

•  A  state  is  composed  of  a  logic  level,  logic  strength  pair  (L.S). 

Le.,state=(L,S)=(LeveLStrength) 

•  The  logic  level  can  be  one  of  three  values:  logic  zero(O),  logic  one(l)  or 
logic  unknown(X).  The  "0"  level  represents  the  low  threshold  value  or 
ground.  The  "1"  level  represents  the  high  threshold  value  or  VDD  The 


"X"  level  represents  am  undetermined  value  which  could  be  "0",  ’’l”  or 
some  value  in  between.  The  logic  level  field  is  extracted  from  the  state 
using  the  "lev"  function.  That  is. 

L  =  lev(state) 

•  The  logic  strength  is  am  integer  value  between  1  and  some  user-specified 
upper  limit  The  upper  limit  has  a  maximum  allowed  value  of  65,536.  In 
this  report,  the  subscripts  F,  W  and  H  will  be  used  to  denote  the  largest, 
middle  and  smallest  strengths  respectively  in  a  given  range.  The 
strength  field  is  extracted  from  the  state  using  the  "str”  function.  That 
is, 

S  =  str{state) 

•  An  initial  unknown.  Xi,  must  be  distinguished  from  an  unknown  gen¬ 
erated  during  the  analysis,  Xg .  This  is  done  in  SPUCE1.7  by  defining  the 
initial  unknown  as  follows  : 

X  =  lev(initiaLlinknown) 

0  =  str(initial_imknown) 

and  the  generated  unknown  as  follows: 

X  =  iev(generatedjmknown) 

0  i *  str(generatedjinknown) 

The  initial  unknown  is  useful  to  identify  nodes  which  are  not  exercised 
by  the  input  pattern  used  in  a  simulation.  As  a  post-processing  step, 
these  nodes  could  be  reported  to  the  user. 

3.2.3.  Using  the  State  Model 

In  a  logic  analysis,  nodes  are  scheduled  to  be  processed  in  the  time 
queue  in  accordance  with  the  activity  in  the  circuit.  When  a  node  is 


include  all  the  first-order  effects  in  the  delay  calculation.  SPLICE1.7  is  capa¬ 
ble  of  modeling  the  effects  due  to  the  first  four  factors.  The  fifth  factor 
(input  waveform  shape)  is  more  difficult  to  handle  at  the  logic  level  although 
it  may  be  a  significant  factor  in  many  cases. 

3L3.2.  Delay  Hodel  for  Simple  Gates 

The  usual  modeling  procedure  for  logic  simulation  is  to  generate  a  set  of 
curves  similar  to  Fig.  3.2  for  every  primitive  element  (NANDs,  NORs,  invert¬ 
ers.  etc.)  using  accurate  electrical  simulation.  In  this  figure,  the  delay  from 
the  input  switching  point  to  the  output  switching  point  is  plotted  as  a  func¬ 
tion  of  output  loading  and  the  number  of  inputs.  A  step  voltage  is  assumed 
as  the  input  of  the  gate.  Although  not  strictly  true,  the  relationships  are 
usually  taken  to  be  linear.  The  y-intercept  of  each  curve  represents  the 
intrinsic  unloaded  gate  delay  while  the  slope  of  each  curve  represents  the 
gate  drive-capability. 

Assuming  that  the  above  information  is  available,  the  following  method 
can  be  used  to  calculate  delays  for  simple  gates.  The  first  requirement  is 
that  a  capacitance  value  be  specified  on  every  input  and  output  pin  of  every 
gate  as  part  of  the  model  definition.  Then  the  total  gate  delay  can  be 
represented  by  four  parameters  :  the  intrinsic  gate  delays  (tr,  tf)  and  the 
gate  drive-capabilities  (trc,  tfc), 

where 

<r=rise  time  for  unloaded  gate  (intercept) 
tf  =fall  time  for  unloaded  gate  (intercept) 
frc  =gate  drive-capability  for  rising  signals  (slope) 
tfc  sgate  drive-capability  for  falling  signals  (slope) 


Fig.  3.2 :  Typical  da  lay  coma  generated  for  a  logic  gala 
using  electrical  analysis 
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Using  these  values,  the  total  delay  is  calculated  using  the  equation: 

risetvme  =tr+trc*(node  capacitance)  (3-la) 

faUtime  =tf  +tfc*(node  capacitance)  (3-lb) 

The  node  capacitance  value  is  extracted  in  a  pre-processing  step  by 
summing  the  capacitances  of  all  elements  connected  to  a  node,  and  stored 
with  the  node  data  structure  (see  APPENDIX  II,  part  1).  This  process  is  illus¬ 
trated  in  Fig  3.3. 

3.3.3.  Delay  Model  for  Multi-output  Elements 

If  a  multi-output  element,  such  as  the  flip-flop  shown  in  Fig.  3.4(a),  is 
available  as  a  primitive  logic  element,  each  output  would  have  its  own  set  of 
curves  similar  to  Fig  3.2.  The  curves  for  the  Q  and  QB  outputs  of  the  flip-flop 
are  shown  in  Fig.  3.4(b).  Then  4 N  parameters  would  be  required  to  specify 
the  delay,  where  N  is  the  number  of  outputs.  For  the  flip-flop  there  would  be 
8  such  parameters  :  Qtr,  Qtf,  QBtr,  QBtf,  Qtrc,  Qtfc,  QBtrc,  and  QBtfc.  These 
parameters  would  be  applied  to  Eqn.  (3-1)  to  calculate  the  delay.  Using  this 
technique,  the  delay  associated  with  each  output  could  be  handled  indepen¬ 
dently.  The  overriding  assumption  is  that  the  rise  and  fall  drive-capabilities 
(trc.tfc)  of  the  outputs  are  constant  and  independent  of  the  inputs. 

In  certain  elements,  the  delay  from  a  particular  input  (say,  the  RESET 
pin  of  the  flip-flop)  to  a  given  output  (either  Q  or  QB)  is  different  from 
another  input  to  output  delay  (J-  or  K-input  to  Q  delay).  This  suggests  that, 
in  fact,  the  intrinsic  delay  should  be  a  matrix  which  is  indexed  by  input  pin 
which  initiates  activity  and  the  output  pin  being  processed.  Then  the  total 
delay  due  to  loading  could  be  calculated  using  Eqn.  (3-1)  and  the  specific  trc 
and  tfc  values  for  each  output.  This  is  shown  in  Table  3.1  below  for  the  flip- 


flop  example. 


Table  3.1  Intrinsic  Delay  Matrix 


I/O  pin 

J 

K 

CLK 

RESET 

SET 

Q 

tr=10 

tf=10 

tr=10 

tf=10 

tr=10 

tf=10 

tr=5 

tf=6 

tr=5 

tf=6 

QB 

tr=10 

tf=ll 

tr=  10 
tf=ll 

tr=10 
tf=  11 

tr=5 

tf=6 

tr=5 

tf=8 

3.3.4.  Delay  Models  for  Transfer  Gates 

The  delay  calculation  for  logic  circuits  containing  transfer  gates  is  mere 
complex  than  either  of  the  tiro  cases  given  above.  Consider  the  circuit  of 
Jig.  3.5.  The  delay  from  Node  A  to  Node  B  when  the  input  CLK  of  the  transfer 
gate  makes  a  transition  from  "O"  to  "1“  is  based  on: 

•  the  W/L  ratio  of  the  transfer  gate 

•  the  drive-capability  of  gate  INV 

•  charge-sharing  between  Cl  and  C2 

It  is  a  highly  nonlinear  situation  and  therefore  difficult  to  model  at  the 
logic  level.  Charge-sharing  cannot  be  represented  properly  because  of  the 
voltage  resolution  In  the  SPLICE1  state  model.  One  method  to  model  this 
effect  is  to  allow  multiple  voltage  levels  in  the  same  way  that  the  impedance 
levels  have  been  extended.  This  would  facilitate  the  characterization  of 
charge-sharing  but  would  make  the  simulator  more  complicated.  The  simu¬ 
lator  would  have  to  perform  transitions  from  one  voltage  level  to  another  in  a 
consistent  manner.  SPLICE 1.7  lumps  all  the  nonlinear  effects  into  two  values 
called  the  turn-on  (ton)  and  turn-off  (toff)  times.  These  values  do  not  take 
capacitive  effects  into  account. 


Another  delay  modeling  issue  concerns  transfer  gates  connected  in 
series  as  shown  in  Fig.  3.6.  The  delay  in  question  is  that  from  Node  A  to  Node 
E.  If  all  gates  are  "ON",  the  circuit  can  be  represented  by  an  RC  transmission 
line.  Unfortunately,  this  is  also  difficult  to  model  at  the  logic  level.  A  few 
alternatives  exist  to  deal  with  this  situation: 

•  Use  a  zero  delay  model  through  transfer  gates  when  they  are  "ON”  [8]. 
This  is  the  method  used  in  SPLICE  1.7.  Unfortunately,  the  value  of  delay 
calculated  this  way  is  overly  optimistic  a  Node  E. 

•  Lump  capacitances  Cl.  C2,  C3,  C4  and  C5  together  and  use  this  value  in 
Eq.  (3*1).  This  is  the  transition  delay  for  all  nodes  from  the  old  state  to 
the  new  one.  The  value  of  delay  calculated  this  way  is  overly  pessimistic 
at  Node  A. 

•  Extend  the  notion  of  drive-capability  of  a  gate  to  nodes  other  than  its 
output  node.  Since  txl  is  ”0N”,  both  Node  A  and  Node  B  are  driven  by 
gate  INV.  Therefore,  the  delay  to  A  could  be  calculated  as  given  in  eq. 
(3-1)  and  the  delay  to  B  could  be  calculated  using  the  equation: 

riaeiimM-ircofy*(capacitancM  at  B)  (3- 2a) 

fcUUim*=tfcg{y*  (capacitance  at  B)  (3-2b) 

To  compute  the  delay  to  nodes  C,  D  and  E,  simply  apply  Eq.  (3-2)  again 
using  the  capacitance  at  node  C,  D  and  E  respectively.  This  approach  is 
better  than  either  of  the  above  methods  but  is  still  lacking  in  accuracy 
because  it  does  not  account  for  the  "ON"  resistance  of  the  transistors. 
One  modification  which  may  provide  more  accuracy  is  to  adjust  the 
values  of  trc  and  tfc  using  the  ”0N"  resistance  of  the  transfer  gates  and 
the  depth  of  the  node  away  from  the  output  of  the  controlling  node.  This 
method  is  promising  because  VLSI  circuits  typically  contain 


fig.  3.5 :  The  delay  from  Node  A  to  Node  B,  when  the  transfer 

gate  turns  on.  Is  due  to  highly  nonlinear  effects  which  are 
difficult  to  model  at  the  logic  leveL 


interconnections  which  axe  electrically  equivalent  to  distributed  RC 
transmission  lines.  This  interconnect  delay  dominates  the  total  delay 
for  very  large  circuits.  It  could  be  modeled  the  same  way  as  the  set  of 
series  transfer  gates.  Therefore,  a  netlist  extractor  could  provide  the 
simulator  with  "DELAY"  elements, as  shown  in  Fig  3.7,  in  place  of  inter¬ 
connect  with  delay  calculations  performed  using  the  modified  eq.  (3-2)  : 

risetime  =  TRC* (capacitance  at  node)  (3-3a) 

falltime  =TFC*  (capacitance  at  node)  (3-3b) 

where 

TRC=trc*f  (resistance  .depth) 

TFC=tfc*f  (resistance  .depth) 


Ri 


Fig.  3.7 :  A  proposed  equivalent  modal  for  a  delay  element 


3.3.5.  Delay  to  an  Unknown  Value 

The  delay  calculations  in  the  previous  section  assume  signal  transitions 
from  "0”  to  ”1"  or  "1"  to  "0".  Nodes  may,  of  course,  acquire  the  X  level  due  to 
contention  at  the  node  as  described  earlier.  The  question  then  arises  as  to 
when  the  X  level  takes  effect.  The  unknown  level  could  be  ”0”,  "1”  or  some 
intermediate  value.  Clearly,  if  the  unknown  is  the  previous  value,  there  is  no 
delay.  If  it  is  a  new  logic  level  there  is  a  rise  or  fall  transition  delay.  The 
usual  approach  is  to  assume  that  the  unknown  value  takes  affect  immedi¬ 
ately  (as  is  done  in  SPLICE1.7)  or  one  time  unit  in  the  future. 

3.4.  Spike  Handling 

SPLICE  1.7  uses  an  inertial  delay  algorithm.  This  means  that  if  a  node  is 
scheduled  to  change  at  some  time  in  the  future  7^,  it  is  held  at  its  old  value 
until  that  time.  Then  at  Tn,  the  new  value  is  assigned  to  the  node  and  the 
fanouts  of  the  node  are  processed  using  this  value.  A  spike  (commonly 
referred  to  as  a  glitch)  occurs  if  the  node  is  scheduled  to  change  to  a 
different  value  before  it  reaches  the  new  value.  Spike  detection  is  simple  in 
true-value  logic  simulation  but  becomes  very  complicated  when  performing 
fault  simulation.  When  a  spike  is  detected,  the  event  at  Tn  is  dropped,  the 
new  event  is  scheduled  at  the  appropriate  time  and  the  user  is  notified  of  the 
glitch.  The  glitch  is  not  propagated  because  it  usually  signifies  an  error  in 
the  circuit  design.  Therefore,  the  simulation  will  continue  as  if  an  error  did 
not  occur  and  more  meaningful  information  may  be  obtained  about  the 
correct  operation  of  the  circuit.  This  technique  also  reduces  the  amount  of 
work  the  simulator  is  required  to  do  since  spikes  represent  activity  in  the 
circuit.  Therefore,  the  overall  CPU-time  will  be  kept  to  a  minimum  by 
removing  glitches  from  the  simulation. 


In  SPLICE  1.7,  a  fanout  list  (FOL)  can  only  appear  once  on  the  time  queue 

at  any  given  time  during  the  processing.  This  is  a  limitation  for  proper  glitch 

handling,  as  will  be  seen  in  the  pseudo-code  description  of  glitch  handling 

which  follows.  Two  different  problems  are  identified  which  are  direct  results 

of  the  scheduling  limitation. 

#GUTCH  HANDLER  IN  SPLICE1.7 
PT  =  present  time 
Tnaxt  -  next  time  FOL  is  scheduled 
Tu*  =  last  time  FOL  was  scheduled  to  be  processed 
(or  was  actually  processed) 

If  (Tuat  <  PT)  }  jjtnode  was  processed  in  the  past 
store  newjtate  ; 
schedule  FOL  at  7*-.^  ; 

else  if  ( 7)^  =  FT)  |  #node  is  scheduled  now 
if  (Tmxt^Tiatt)  I 

if  (FOL  processed)  $  #  PROBLEM  :  glitch  has  been  propagated 
update  newjtate  ; 
schedule  FOL  at  7*-^  ; 

else  $  #FOL  has  not  been  processed 
fPRQBLEM  :  cannot  schedule  FOL  more  than  once 
drop  schedule  at  > 
replace  newjtate  ; 
schedule  FOL  at  Tnut  • 


else  if  (That  >  PT)  j  #node  is  scheduled  in  the  future 

if (%*« 

#reschedule  time  is  earlier  than  originally  scheduled  time 
report  glitch ; 
drop  schedule  at  Ti— .  ; 
replace  new_$tate  : 
schedule  FOL  at  Pnwt  • 

else  if  ( PfiMt  =  Tuot)  { 
report  glitch : 
replace  newjtate  ; 

else  if  (Tnrrt  >  Tuat)  {  #want  to  scbed  in  future 
report  glitch ; 
drop  schedule  at  T io*  • 
store  newjtate  ; 
schedule  FOL  at  TL-w  ; 


The  problems  identified  above  can  be  summarized  as  follows:  depending 
on  the  order  in  which  nodes  are  processed  at  a  timepoint,  the  program  may 
or  may  not  propagate  one  particular  type  of  glitch,  which  will  be  referred  to 
as  the  Edge  glitch  or  "E"  glitch.  Therefore,  the  output  of  the  simulation 
depends  on  the  order  in  which  the  circuit  was  specified  by  the  user.  The  ”E" 
glitch  is  always  identified  but  its  propagation  is  based  on  node  processing 
order.  One  way  to  get  around  this  problem  is  to  use  a  two-pass  approach  by 
first  performing  a  leveling  operation  [28]  as  a  preprocessing  step.  This  sim¬ 
ply  means  that  each  node  should  be  assigned  a  value  based  on  its  depth  from 
the  inputs.  Then  every  node  scheduled  at  a  given  timepoint  should  be  pro¬ 
cessed  in  ascending  order.  This  would  incur  some  overhead  but  would  pro¬ 
duce  the  desired  results,  i.e.,  the  same  solution  regardless  of  the  order  of 
the  input  description.  At  the  present  time,  SPUCE1.7  will  identify  the  glitch 
and  may  or  may  not  propagate  the  glitch  depending  on  the  order  the  nodes 
are  processed. 

Another  way  to  eliminate  the  problem  is  by  modifying  the  scheduler 
data  structure  so  that  multiple  schedules  are  allowed.  Instead  of  scheduling 
FOLs,  it  would  be  better  to  schedule  structures  which  point  to  the  FOL.  This 
structure  would  have  to  include  other  information  such  as  the  schedule  time, 
and  forward  and  backward  pointers  to  the  next  and  previous  schedules  of  the 
same  FOL  in  the  time  queue.  This  would  allow  easy  access  to  all  the 
schedules  of  a  single  FOL  for  adding  and  dropping  subsequent  events.  This 
proposed  data  structure  is  shown  in  Fig.  3.8.  One  advantage  of  multiple 
scheduling  is  that  the  program  can  be  modified  to  perform  parallel  fault 
simulation  using  this  data  structure. 
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A  simple  circuit  which  is  useful  for  debugging  glitch  handling  code  is  the 
clock  generator  shown  in  Fig.  3.9.  By  adjusting  tr  and  tf  for  each  gate,  all 
possible  glitch  conditions  can  be  produced.  For  example,  if  fr=f/  =10ns,  the 
"E"  glitch  can  be  generated. 

3.5.  Transfer  Gate  If  ode  ling  Issues 


The  incorporation  of  strengths  into  the  state  model  does  not  in  itself 
solve  all  the  problems  of  MOS  logic  simulation.  As  described  in  the  previous 
section,  delay  modeling  is  still  difficult  and  the  notion  of  strengths  does  not 
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provide  any  leverage  in  solving  the  problem.  Transfer  gates  complicate  the 
situation  even  more  because  they  introduce  dynamic  loading  effects,  bidirec¬ 
tional  signal  flow,  node  decay,  and  charge-sharing.  In  the  sections  to  follow, 
these  and  other  problems  concerning  the  transfer  gate  are  described  and 
the  solutions  used  In  SPLICE1.7  are  presented. 

3.5.1.  Bidirectional  Transfer  Gates 

In  general,  the  transfer  gate  is  a  bidirectional  element  but  it  is  usually 
found  in  a  unidirectional  application.  That  is,  the  designer  intended  signals 
to  flow  in  one  direction  through  the  device.  SP1JCE1.7  provides  unidirec¬ 
tional  transfer  gates  (ITDCG)  for  this  purpose,  as  it  simplifies  the  processing 
thereby  reducing  CPU-time. 

On  the  other  hand,  there  are  occasions  when  transfer  gates  are  used  in 
bidirectional  applications  and  therefore  the  logic  simulator  must  be  able  to 


analyze  them  accurately.  There  have  been  a  variety  of  modeling  approaches 
for  bidirectional  transfer  gates  (BTXG),  including  the  conventional  approach 
of  two  unidirectional  elements  back-to-back  as  shown  in  Fig.  3.10.  This 


Jig.  3.9 :  A  simple  circuit  which  can  be  used  to  generate  the  Tartous 
glitch  conditions  by  adjusting  the  values  of  tr  and  tf. 
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fig.  3.10  :  A  bidirectional  transfer  gate  model  which  employs 

»,iPi?rBctioQal  transfer  gates.  The  usual  problem 
a»ocLjted  with  this  approach  is  that  contentions  may  not  be  resolved 
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site  sides  of  the  element,  as  is  the  case  in  Fig.  3.10.  Each  value  can  flow 
through  the  BTXG  and  reach  the  opposite  side  and  these  errors  can  percolate 
further  through  the  circuit  producing  incorrect  results.  One  simple  way  to 
process  BTXG's  in  a  consistent  way  is  to  introduce  the  concept  of  composite 
node  relaxation  (CNR).  In  this  method,  every  node  connected  through 
transfer  gates  which  are  "ON"  are  considered  to  be  the  same  node  for  pro¬ 
cessing  purposes.  All  fan  In  lists  for  the  composite  node  are  combined  into 
one  list  and  a  new  state  is  determined  based  on  the  composite  fanin  list. 
Since  all  nodes  connected  by  "ON"  BTXG's  are  considered  the  same  node, 
there  is  no  delay  between  them. 

3.5.2.  Unknowns  at  Gate  Inputs 

Another  problem  in  modeling  transfer  gates  is  due  to  unknowns  at  gate 
inputs.  The  problem  is  identified  in  Fig.  3.11.  Normally,  if  the  transfer  gate 
is  "ON"  and  then  shuts  "OFF',  the  output  Node  A  retains  its  previous  value 
but  is  reduced  in  strength  (goes  to  the  H  strength).  This  is  shown  in  Fig. 
3.11(a).  There  are  three  cases  to  consider  in  conjunction  with  unknowns  at 
transfer  gates. 

CASE  1  :  Fig.  3.11(b)  indicates  the  situation  at  the  beginning  of  the  simula¬ 
tion.  Virtually  all  gate  inputs,  except  for  the  ones  that  have  been  initialized 
explicitly,  are  in  the  initial  unknown  condition.  In  this  situation,  the  gate 
may  or  may  not  pass  signals. 
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CASE  2  :  The  second  situation  occurs  when  there  is  a  logic  "1”  at  the  input 
and  it  changes  to  a  logic  “X".vln  this  case,  the  level  at  the  output  remains  the 
same  but  the  strength  is  not  known.  This  is  illustrated  in  Fig.  3.11(c). 
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CASE  3  :  The  third  situation  is  the  reverse  of  the  second.  Here,  the  input 
goes  from  logic  "O''  to  logic  "X”.  Both  the  value  and  strength  may  change. 
This  is  shown  in  Fig.  3.11(d). 

There  are  a  few  alternative  methods  to  handle  unknowns  at  gate  inputs. 

(1)  a  pessimistic  approach  is  to  generate  Xj  at  the  output  so  that  it  will  be 
propagated  further.  This  may  produce  incorrect  circuit  operation  if 
CASE  2  is  considered,  but  is  the  easiest  to  implement. 

(2)  another  approach  is  to  have  the  notion  of  unknown  strengths.  Using  this 
model,  CASE  2  could  be  handled  by  setting  the  output  node  to  its  previ¬ 
ous  value  with  an  unknown  strength.  This  introduces  some  complications 
in  the  way  the  simulator  processes  nodes.  Some  bit  pattern  would  have 
to  be  selected  for  unknown  strengths.  It  is  not  clear  how  this  special 
strength  value  would  interact  with  other  strengths. 

Currently,  method  (l)  is  used  in  SPLICE1.7  and  other  methods  are  under 
investigation  [29]. 

3.5.3.  Node  Decay 

When  a  node  acquires  the  H  strength,  it  retains  the  previous  state  on 
the  capacitance  at  the  node.  In  physical  terms,  charge  is  trapped  at  the 
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node  but  there  are  parasitic  resistive  paths  from  the  node  to  ground  or  VDD. 
Therefore,  the  node  will  eventually  lose  its  value  and  it  will  become  unknown. 
This  is  referred  to  as  node  decay.  The  time  constant  for  the  decay  is  large 
but  finite. 

It  is  useful  to  include  node  decay  as  part  of  the  simulation,  especially  for 
dynamic  circuits.  One  way  to  do  this  is  to  detect  the  H  strength  at  a  node 
and  schedule  the  node  to  decay  after  a  specified  amount  of  time  by  setting  a 
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special  flag  at  the  node.  If  the  node  is  not  redriven  before  this  time,  the 
node  is  placed  in  the  unknown  state.  If  the  node  is  redriven,  the  scheduled 
event  would  be  dropped  and  processing  would  continue  in  the  normal  way. 
Unfortunately,  the  program  would  incur  an  excessive  amount  of  scheduling 
and  de-scheduling  overhead,  especially  in  the  case  of  dynamic  MOS  circuits. 
Moreover,  in  SPLICE1.7,  most  of  the  scheduled  nodes  would  be  put  into  the 
pool  (see  Appendix  II,  part  5).  The  pool  is  an  overflow  area  designed  to  store 
all  schedules  which  are  greater  than  200  timepoints  in  the  future.  If  node 
decay  is  processed  as  suggested  above,  the  scheduled  decay  events  would  all 
be  placed  in  the  pool  and  eventually  overflow  the  limit  of  the  pool  area. 
Clearly  this  is  not  a  suitable  approach. 

An  alternative  approach,  proposed  by  Boyle  [30  ],  is  to  simply  store  the 
decay  time  along  with  the  node  data  structure  and  avoid  scheduling  alto¬ 
gether.  Anytime  the  node  is  redriven,  this  value  could  be  compared  to  the 
current  time.  If  the  current  time  Is  greater  than  the  decay  time,  a  warning 
message  could  be  placed  in  a  file,  if  the  user  has  requested  decay  errors. 
Then  processing  would  continue  as  if  node  decay  had  not  occurred.  It  is  not 
useful  to  simulate  the  circuit  under  decay  conditions  because  it  is  usually  a 
design  error.  Therefore,  it  is  simply  flagged  as  an  error  and  then  ignored  for 
the  remainder  of  the  simulation 

3.8.  Logic  Emulation  Implementation  Details 

3.8. 1.  General  Program  now 

The  following  is  a  high-level  pseudo-code  description  of  the  general  pro¬ 
gram  flow  during  a  logic  analysis.  Note  the  parallel  between  the  ITA  program 
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flow  described  in  the  previous  chapter  and  the  code  below; 


set  all  nodes  to  their  initial  values  ; 
schedule  all  FOL's  at  time  0  ;  #  FOL  =  fanout  list 
> 

while  (  t^nTSTOP  )  { 

for  (each  FOL  in  the  queue  at  the  current  timepoint)  { 
for  (each  element  in  the  FOL )  $ 

for  (each  output  node  of  an  element)  { 

process  node  ;  jjfsee  next  section  for  details 
schedule  FOL  if  necessary  ; 


plot  all  active  nodes  ; 
tn*~tn fl  ! 


3.6.2.  Node  Processing  Details 


#  LOGIC  NODE  PROCESSING  DETAILS 

# 

begin 

current_ytate  *-  (X.0) ; 

place  node  in  CNL ;  #  CNL  =  composite  node  list 
for  (each  node  in  the  CNL)  J 

for  (each  element  in  the  FIL)  |  #  FIL  =  fanin  list 
if  (element  =  BTXG)  &  (gate  *  "ON")  | 
place  node  in  CNL  ; 

else  l 

determine  output_jtate  (L.S)  of  element ; 
intendjjtate  *•  outputjtate  ; 

If  (  str(intendedjtate)  >  str(current_$tate) ) 
current_5tate  *•  intendedjtate  ; 
else  if  (  str(intendedL$tate)  =  str(current_state) ) 
if  ( lev(intendedjtate)  *  lev(current_$tate)  ) 
liev(current_5tate)  *■  X ; 


newjtate  *•  currentjtate 
if  (new _jtate  *  oldjtate)  { 
for  (each  node  in  CNL)  ( 

calculate  delay(old_$tate,new_state) ; 

call  GLITCH  HANDLER  to  schedule  FOL  ;  §  FOL  =  fanout  list 


«.  .s.; 

•  -v  •  .«% 


3.7.  Switch-level  Simulation 


The  definition  of  UTXG  and  BTXG  elements  (given  in  3.5.1)  allows  switch- 
level  simulation  [9, 10]  to  be  performed  using  SPLICEl.7.  Loads  can  be 
modeled  using  a  UTXG  with  a  "weak"  output  strength.  Drivers  can  be 
modeled  using  a  UTXG  with  a  "forcing"  output  strength.  Either  a  BTXG  or  a 
UTXG  can  be  used  for  pass  transistors  depending  on  the  application.  Other 
floating  transistors  must  be  BTXG's.  If  "ton"  and  "toff"  are  specified  as  1  unit 
of  time,  then  a  unit-delay  switch-level  simulation  will  be  performed  by 
SPLICEl.7.  For  obvious  reasons,  delays  at  the  switch-level  cannot  be 
modeled  in  the  same  way  as  it  is  currently  done  is  SPLICEl.7  for  standard 
Boolean  gates.  Two  approaches  have  been  proposed  to  introduce  detailed 
timing  information  at  the  switcbrlevel  using  a  resistive  simulation  model 
[31. 32].  These  methods  are  under  investigation  at  the  present  time. 
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CHAPTER  4 


4.  Examples  and  Results 

In  this  chapter,  a  number  of  simulation  results  and  program  performance 

statistics  of  SPLICEl.7  are  presented.  Five  aspects  of  the  program  are  exam¬ 
ined  in  the  sections  to  follow.  These  are  : 

•  the  program  performance  statistics  such  as  processing  speed  for  the 
electrical  and  logic  analyses,  typical  storage  requirements  per  element, 
iteration  counts,  etc. 

■  the  identification  of  bottlenecks  using  profilers 

■  the  factors  which  aflect  the  run-times  such  as  mindvsch.  sor.  nirt  and 
floating  capacitors 

•  SPLICE1/SPICE2  comparisons  for  execution  speed,  memory  require¬ 
ments  and  simulation  accuracy 

•  mixed  switch,  logic  and  electrical-level  simulations 

The  simulations  were  carried  out  on  the  following  circuits: 

(1)  Digital  niter  Circuit :  This  circuit  was  obtained  from  [l].  It  is  the  con¬ 
trol  logic  for  a  digital  filter  circuit  There  are  705  MOS  transistors  and 
393  nodes  in  the  circuit.  The  simulation  period  is  4/*s. 

(2)  Counter-Decoder-Encoder  Circuit :  This  circuit  is  a  combination  of  a  4- 
bit  counter  driving  a  4:18  decoder  and  a  16:4  encoder.  It  is  referred  to 
as  the  CDE  circuit  in  the  rest  of  the  chapter.  The  switching  characteris¬ 
tics  were  based  on  the  specifications  provided  in  a  TTL  Handbook  [33]. 
The  circuit  has  1,326  MOS  transistors  and  553  nodes  and  is  the  largest 
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circuit  simulated  so  far.  The  simulation  period  is  also  4 /is  [34]. 

(3)  NMOS  Operational  Amplifier  :  This  circuit  was  obtained  from  [35],  It 
was  designed  as  part  of  a  phase-locked  loop  circuit.  This  circuit  is  used 
to  illustrate  the  capability  of  ITA  when  simulating  analog  circuits. 

(4)  Boot-strapped  Inverter  Circuit  :  This  circuit  was  described  earlier  in 
Chap.  2.  It  is  illustrated  in  Fig.  2.4(b).  The  circuit  is  used  to  examine 
the  effects  of  a  floating  capacitor  element  in  an  ITA  simulation. 

(5)  Industrial  Microprocessor  Control  Circuit  :  This  is  the  critical  path 
through  the  control  circuitry  of  a  fjP  designed  using  NMOS  technology. 

(8)  Industrial  64K  Ram  Circuit :  This  is  a  portion  of  a  high-speed  84K  static 
RAM  circuit  designed  using  CMOS  technology. 

(7)  4x5  Multiplier  Circuit :  This  circuit  uses  standard  multiplier  structure. 
It  features  novel  exclusive -OR  and  ADDER  functions  designed  by  Kuni- 
nobo  [38].  This  example  is  used  to  illustrate  mixed-mode  simulation 
and  to  compare  the  run-times  associated  with  transistor-based  simula¬ 
tion  at  the  electrical  and  switch  levels. 

4. 1.  Program  Performance  Statistics 

In  order  to  predict  the  run-times  and  memory  requirements  of  the  pro¬ 
gram  SPLICE1.7,  the  program  execution  speed  and  memory  usage  statistics 
are  required.  These  statistics  have  been  tabulated  below  for  both  the  electri¬ 
cal  and  logic  simulators. 
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Dectrical  Simulation  Statistics 


Node  Evaluations  400  nodes/sec. 

SOR-Newton  Iterations  (no  floating  caps)  3-5  iterations/node 

SOR-Newton  Iterations  (with  floating  caps)  6-20  iterations/node 


Electrical  Element  Storage  Requirements 


Elements 

Type 

Words  Required 

Transistors 

Load 

3  x  no.  of  loads 

Driver 

4  x  no.  of  drivers 

Transistor 

5  x  no.  of  transistors 

Capacitors 

Grounded 

0 

Floating 

3  x  no.  of  capacitors 

Resistors 

3  x  no.  of  resistors 

Element  Model 

Type 

Words  Required 

Transistors 

Load 

13  x  no.  of  loads 

Driver 

13  x  no.  of  drivers 

Transistor 

14  x  no.  of  transistors 

Capacitors 

Grounded  1  x  no.  of  grounded  capacitors 
Floating  3  x  no.  of  floating  capacitors 

Resistors 

3  x  no.  of  resistors 

Logic  Simulation  Statistics 

Node  Evaluations  650  nodes/sec. 

Logic  Element  Storage  Requirements 

Dements 

Words  Required 

inverter 

3  x  no.  of  inverters 

buffer 

3  x  no.  of  buffers 

AND 

~  5  x  no.  of  ANDs 

OR 

n5i  no.  of  ORs 

NAND 

—  5  x  no.  of  NANDs 

NOR 

'*■  5  x  no.  of  NORs 

X0R 

"■  5  x  no.  of  XORs 

XN0R 

~  5  x  no.  of  XNORs 

transfer  gates 

~  4  x  no.  of  devices 

Model  Type 


Words  Required 


inverter  11  x  no.  of  different  inverter  models 

buffer  11  x  no.  of  different  inverter  models 

AND  ~  11  x  no.  of  different  AND  models 

OR  ~  11  x  no.  of  different  OR  models 

NAND  ~  11  x  no.  of  different  NAND  models 

NOR  ~  11  x  no.  of  different  NOR  models 

XOR  ~  11  x  no.  of  different  XOR  models 

XNOR  ~  11  x  no.  of  different  XNOR  models 

transfer  gates  ~  8  x  no.  of  device  models 


Node  Storage  Requirements 


N  =  number  of  circuit  nodes 

Data  Logic  Node  Electrical  Node 


Node  list  1  1 

Node  pointers  N  N 

Node  data  8N  9N 

Node  FIL  N  3N 

Node  FOL  3N  3N 

Capacitor  N  N 


4.2.  Profile  Statistics 

The  SPLICE1.7  program  execution  times  can  be  reduced  somewhat  by 
applying  the  techniques  suggested  in  Chap.  2.  These  were  based  on  intuitive 
arguments.  It  is  important  to  identify  bottlenecks  in  the  program  and  iden¬ 
tify  where  it  is  spending  most  of  its  time  in  a  quantitative  way.  A  profiler  is  a 
modern  programming  tool  which  is  very  useful  for  this  task.  It  monitors  the 
program  during  execution  and  provides  information  relating  to  the  percen¬ 
tage  of  time  spent  in  each  subroutine.  Using  this  information,  the  program 
can  be  modified  in  sections  where  it  will  provide  the  most  benefit. 

The  following  profile  statistics  were  obtained  from  a  simulation  of  the 


digital  filter  circuit  using  electrical  analysis. 


total  time:  3196  seconds 


e(%) 

Ume(sec.) 

no.  of  calls 

name 

28.7 

916.33 

1222883 

prtim 

17.3 

554.47 

2449814 

tntxg 

14.2 

455.32 

5799795 

geterv 

8.2 

231.63 

951895 

sqrt 

6.3 

203.63 

900167 

tndri 

5.1 

164.07 

658140 

tnloa 

4.7 

152.62 

792936 

prelm 

1.6 

51.92 

4001 

prfot 

1.3 

40.23 

276985 

adsfo 

1.2 

37.53 

24046 

dropf 

0.3 

9.70 

20547 

prout 

subroutine  task 
processing  a  timing  node 
evaluate  transistor  model 
get  a  value  from  another  node 
perform  a  square  root  operation 
evaluate  driver  model 
evaluate  load  model 
process  an  element  in  the  FOL 
process  a  FOL  in  the  time  queue 
add  a  FOL  to  the  time  queue 
drop  a  FOL  from  the  time  queue 
print  out  a  node 


It  is  clear  that  most  of  the  time  is  spent  processing  nodes  and  evaluating 
transistor  models  for  the  SOR-Newton  iteration.  Therefore,  any  speed-up 
techniques  should  be  applied  to  these  areas  of  the  program.  It  is  expected 
that,  as  the  program  is  developed  further,  information  provided  by  the 
profiler  can  be  used  to  significantly  reduce  execution  time,  particularly  for 
electrical  simulation.  Other  methods,  currently  available  to  the  user  to 
reduce  the  total  run-time,  are  described  in  the  next  section. 

4.3.  Factors  Affecting  Execution  Time  in  Electrical  Simulation 

4.3.1.  CPU-time  th.  MRT 


SPLICEl.7  has  a  user-specified  fixed  minimum  timestep  called  the  mrt 
(Minimum  Resolvable  Time).  The  symbol  h  will  be  used  to  refer  to  the 
timestep  associated  with  a  particular  node.  In  SPLICEl,  h  is  some  integer 
multiple  of  mrt.  Although  the  minimum  timestep  is  fixed,  the  value  of  h  for 
a  specific  node  is  dependent  on  the  activity  at  that  node.  For  example,  when 
the  node  is  active,  h  is  equal  to  mrt.  Otherwise,  h  is  defined  by  the  time 
difference  between  two  events  at  the  node.  In  a  sense,  the  timetstep  at  a 
node  is  determined  implicitly  by  the  activity  in  the  circuit. 
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Since  there  is  no  explicit  timestep  control  mechanism  in  SPLICE  1,  the 
CPU-time  required  to  perform  electrical  simulation  is  a  strong  function  of 
mrt.  Fig.  4.1  illustrates  this  relationship  for  the  CDE  circuit.  There  is  an 
optimum  value  of  mrt  for  this  particular  circuit  at  Ins.  At  values  of  mrt 
greater  than  the  optimum,  the  program  iterates  longer  to  produce  solution 
at  a  particular  timepoint.  This  is  due  to  the  fact  that  a  linear  prediction  gen¬ 
erates  a  poor  guess  as  the  timestep  is  increased  and  the  diagonal  terms  of 
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the  conductance  matrix,  are  reduced  thereby  weakening  the  diagonal 

fl 

dominance  property  of  the  matrix.  In  fact,  at  a  very  large  value  of  mrt  the 
program  may  not  converge  at  all. 

At  mrt  values  less  than  the  optimum,  the  program  is  forced  to  do  more 
work  during  the  active  periods  than  is  really  necessary,  based  on  the  time 
constants  in  the  circuit.  Therefore,  there  is  a  rapid  rise  in  the  curve  in  Fig. 
4.1  below  the  optimum.  It  has  been  observed  that  at  very  small  timesteps, 
the  curve  begins  to  level  off.  This  is  probably  due  to  the  fact  that  the  predic¬ 
tion  step  is  very  accurate  and  only  one  or  two  iterations  are  required  for  con¬ 
vergence. 

The  relation  between  CPU-time  and  mrt  suggests  that,  for  a  given  tech¬ 
nology,  the  optimum  value  should  be  obtained  through  experiment  and  used 
in  all  further  simulations.  Of  course,  if  an  explicit  dynamic  timestep  control 
mechanism  is  implemented,  this  would  not  be  necessary. 

4.3.2.  CPU-time  vs.  1QNDVSCH 

In  SPLICEl,  events  at  a  node  are  propagated  to  its  fanouts  if  the  change 
in  the  node  voltage  is  considered  to  be  significant.  If  the  change  is  not 
significant,  the  fanouts  are  not  scheduled  and  the  node  is  returned  to  its 
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original  value.  This  constitutes  the  event-driven  selective-trace  feature  in 
SPLICE1.  The  scheduling  threshold  parameter,  used  to  determine  whether  or 
not  the  change  is  significant,  is  called  mindvsch.  It  is  specified  in  units  of 
volts,  by  the  user,  for  an  entire  simulation  and  is  the  same  for  every  node  in 
the  circuit. 

The  value  chosen  for  mindvsch  has  a  profound  effect  of  the  simulation 
results.  Careful  consideration  must  be  given  to  select  an  appropriate  value 
for  this  parameter.  It  can  be  thought  of  as  the  minimum  voltage  change 
which  can  affect  the  fanout  elements  of  a  node.  Based  on  experience  with 
the  program,  an  appropriate  value  for  most  digital  circuits  is  1  mV.  It  is 
much  smaller  for  analog  circuits,  particularly  if  there  are  high-gain  stages. 

The  effect  of  mindvsch  on  CPU-time  is  quite  dramatic  as  shown  in  Fig. 
4.2.  As  mindvsch  is  increased,  the  CPU-time  goes  down.  This  suggests  that 
the  value  should  be  made  as  large  as  possible.  Unfortunately,  some 
significant  events  may  be  accidently  dropped  if  the  mindvsch  is  too  large 
resulting  in  a  loss  of  accuracy.  Also,  node  voltages  can  only  reach  a  value 
which  is  within  mindvsch  of  their  final  value  because  the  remaining  voltage 
change  Is  not  considered  significant.  Hence,  if  mindvsch  is  too  large,  there 
will  be  errors  at  the  end  of  each  transition. 

4.3.3.  Effect  of  Floating  Capacitors 


As  indicated  in  Chap.  2,  floating  capacitors  no  longer  pose  a  problem  to 
the  electrical  analysis  in  terms  of  accuracy  but  tend  to  degrade  the  simula¬ 
tor  performance.  The  factor  which  determines  the  amount  of  degradation  is 
the  ratio  of  the  floating  capacitor  to  the  grounded  capacitor.  In  order  to 
illustrate  the  relationship  between  CPU-time  and  capacitance  ratio,  the 
boot-strapped  inverter  of  figure  2.4(b)  was  simulated  with  different  values  of 
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£Laai  .  The  results  have  been  plotted  in  Fig.  4.3.  It  is  clear  from  this  graph 

'~gnd 

that  the  relationship  obeys  a  square-root  law. 

Although  the  graph  indicates  that  solutions  may  be  obtained  regardless 
of  the  ratio,  this  is  not  true  in  general.  Therefore,  if  the  ratio  is  too  large, 
the  iteration  may  not  converge.  Special  techniques  must  be  used  to  reduce 
the  number  of  iterations  required  to  solve  nodes  with  floating  capacitors 
since  this  is  usually  the  case.  Research  is  underway  to  find  ways  to  accom¬ 
plish  this. 

4.3.4.  CPU-time  vs.  SOR 

The  sor  parameter  was  introduced  in  Chap. 2  as  an  acceleration  parame¬ 
ter  for  the  nonlinear  Gauss-Seidel  iteration.  This  parameter  has  a  significant 
effect  on  the  CPU-time  but  does  not  affect  simulation  accuracy.  Fig.  4.4 
shows  the  relationship  between  CPU-time  and  sor  obtained  from  simulations 
performed  on  the  digital  filter  circuit  There  is  an  optimum  value  of  sor 
which  minimizes  the  run-time  of  the  simulation.  In  this  case,  the  optimum 
value  is  0.8. 

Although  the  optimum  value  changes  from  technology  to  technology,  it 
is  worthwhile  to  obtain  the  value  experimentally  as  it  may  provide  a  substan¬ 
tial  improvement  over  the  standard  Gauss-Seidel  iteration. 

4.4.  SPJCE2  vs.  SPUCE1.7 

Five  circuits  were  simulated  at  the  electrical  level  using  SPICE2  and 
SPUCE1.7  to  compare  run-times,  memory  requirements  and  accuracy. 
These  simulations  were  performed  on  a  VAX-11/780  under  the  UNIX  operating 
system.  In  both  simulators,  default  parameter  values  were  used  for  the 
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convergence  criteria,  integration  method,  and  accuracy  tolerance.  The 
values  are  available  in  the  user  guide  for  each  program. 

4.4.1.  CDE  Circuit 

A  block  diagram  of  the  CDE  circuit  is  shown  in  Fig.  4.5.  The  details  of 
the  blocks  are  given  in  [33].  This  circuit  is  large  and  highly  unidirectional  in 
nature.  A  4-bit  counter  provides  a  sequence  of  inputs  to  the  4:16  decoder, 
the  output  of  which  is  encoded  to  4-bits.  The  simulation  period  was  4/xs  with 
an  mrt  of  Ins.  As  shown  in  Fig.  4.6,  the  output  waveforms  have  long  latent 
periods.  For  this  circuit,  SPLICE  1.7  was  66  times  faster  than  SPICE2  and  its 
memory  usage  was  35  times  smaller,  for  comparable  accuracy.  More  impor¬ 
tantly,  the  SPICE2  simulation  required  32  hours  whereas  the  SPLICEl  simula¬ 
tion  required  only  40  minutes!  This  represents  a  substantial  improvement  in 
speed  and  allows  a  simulation  of  this  magnitude  to  be  quite  feasible.  A  small 
fraction  of  this  speed  advantage  can  be  attributed  to  the  fact  that  SPLICEl 
has  been  tailored  for  transient  analysis,  but  the  the  key  reason  for  the 
improvement  is  the  efficient  exploitation  of  latency. 

The  simulation  results  are  summarized  in  Table  4.1.  Also  included  in 
this  table  are  the  simulation  results  obtained  using  SPLICEl. 3,  an  earlier  ver¬ 
sion  of  SPLICEl  which  used  NTA.  It  is  interesting  to  note  that  SPLICE1.7 
required  only  twice  as  much  time  as  SPLICEl. 3  to  produce  a  solution  even 
though  it  iterates  to  convergence.  SPLICEl. 3  uses  only  one  SOR-Newton 
iteration  at  each  timepoint,  but  it  can  reject  the  new  solution  if  the  change 
In  voltage  over  an  mrt  is  considered  to  be  too  large,  as  determined  by  a 
parameter  called  "maxdvstep".  This  is  done  to  maintain  simulation  accu¬ 
racy.  For  example,  if  the  voltage  change  between  times  f„_i  and  £„  is  con- 
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Fig.  4.5 :  Black  Diagram  of  the  CDE  circuit 


Fig.  4.8 :  Output  waveform*  of  CDE  circuit*.  The  waveform* 

feature  long  latent  period*  which  i*  an  advantage  for  ITA. 


sidered  to  be  too  large,  the  program  would  cut  the  timestep  by  a  factor  of  4 
and  perform  another  aeries  of  single  iterations  at  timepoints  fn_ 

4 

h  3 h 

1+— ,  tB-i+-^~and  tn.  Further  timestep  cutting  would  be  done  if  the  vol¬ 
tage  change  was  deemed  to  be  too  large  over  any  subinterval.  Therefore, 
many  evaluations  may  be  performed  to  produce  the  solution  a  timepoint, 
although  a  single  iteration  is  always  used  at  any  given  point  in  time.  The 
attempt  to  maintain  accuracy  in  this  manner  increases  the  overall  simula¬ 
tion  time.  Furthermore,  on  the  first  iteration  at  a  timepoint,  SPLICE1.7  uses 
previous  history  to  predict  a  new  voltage  at  a  node,  thereby  reducing  the 
total  number  of  iterations  required  to  converge  to  a  solution. 


BnruMI 

Time 

(s) 

Memory 

(Kbyte) 

SPICE2G 

115,840 

2,420 

SPLICE1.7 

1.740 

69.9 

Ratios 

66 

35 

SPLICE1.3 

843 

68.9 

Table  4.1 

Comparison  of  conventional  circuit  simulation 
and  ITA  for  the  CDE  circuit. 


One  of  the  pulses  in  Fig.  4.6  has  been  magnified  in  Fig.  4.7  to  compare 
the  accuracy  of  the  three  approaches  used  to  simulate  the  CDE  circuit,  as 
*ndicated  in  Table  4.1.  The  pulses  from  SPL1CE1.7  and  SP1CE2  are  centered 
at  1627ns  and  1629ns  respectively.  This  difference  would  be  indistinguish¬ 
able  at  the  level  shown  in  Fig.  4.6  and  it  is  not  clear  which  is  the  more  accu¬ 
rate  solution.  In  this  case,  the  true  solution  lies  between  the  SPLICE1.7  and 
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ng.A.7  :  The  pulse  shown  In  4.8  has  been  magnified  here 
to  compare  TTA  with  NTA  and  SPIC22. 


SP1CE2  results.  The  output  of  SPLICE1.3  features  some  numerical  noise  in 
the  vicinity  of  0.5  volts  which  can  be  attributed  to  the  single  SOR-Newton 
iteration  used  in  NTA.  The  pulse  generated  by  SPLJCE1.3  is  approximately 
correct  in  its  size  and  shape  but  is  centered  incorrectly  at  1608ns,  an  error 
of  20ns. 

4.4.2.  Digital  filter  Circuit 

The  block  diagram  of  the  digital  filter  is  given  in  Fig.  4.8.  Further  details 
may  be  found  in  [l].  The  circuit  was  simulated  using  SPL1CE1.7  and  required 
1783  seconds.  The  original  SPLICE1  program,  as  described  in  [l],  required 
453  seconds  which  is  4  times  faster.  Clearly,  the  cost  of  ITA  vs.  NTA  depends 
on  the  size  and  nature  of  the  circuit,  but  the  key  point  is  guaranteed  accu¬ 
racy  in  the  solution  produced  by  ITA.  For  this  circuit,  SPUCE1.7  was  17 
times  faster  and  its  memory  usage  was  21  times  smaller  than  SP1CE2.  This  is 
due  to  the  fact  that  this  circuit  is  somewhat  smaller  than  the  CDE  circuit  and 
has  much  more  activity. 


Circuit 

Mosfets 

Nodes 

Digital  Filter 

705 

393 

Time 

(s) 

Memory 

(Kbyte) 

SPICE2G 
SPLICE  1.7 

30,582 

1,783 

1,038 

48.4 

Ratios 

17.1 

21.4 

Table  4.2 

Comparison  of  conventional  circuit  simulation, 


4.4.3.  Industrial  uP  Control  Circuit 


This  schematic  for  this  circuit  is  shown  in  Fig.  4.9.  It  contained  over  100 
transistors  and  100  diodes  and  is  representative  of  a  typical  simulation  per¬ 
formed  using  the  SPICE2  program.  Although  no  extra  elements  were  added 
to  this  circuit,  there  was  a  capacitance  to  ground  at  each  node  of  at  least 
10FF  in  value.  The  SPL1CE1.7  simulation  required  4  min.  while  the  SPICK 
simulation  required  24  min.  The  memory  requirements  were  B  times  less  for 
the  SPLICE1  job.  The  output  waveforms  are  shown  in  Fig.  4.10. 


Circuit 

Mosfets 

Diodes 

Nodes 


Ratios 


uP  Control  Circuit 
116 

116 

66 

Time 

(») 

Memory 

(Kbyte) 

1426.B 

177.2 

205.9 

26.2 

8 

B 

Table  4.3 

Comparison  of  conventional  circuit  simulation, 
and  ITA  for  an  Industrial  /iP  Control  Circuit 


4.4.4.  Industrial  64K  CMOS  Static  RAH  Circuit 

The  block  diagram  for  this  example  is  given  in  Fig.  4.11  and  each  subcir¬ 
cuit  schematic  is  shown  in  Fig  4.12.  This  circuit  contained  over  300  transis¬ 
tors  and  is  an  example  of  a  industrial  circuit  which  would  be  very  expensive 
to  simulate  using  SPICE2.  The  circuit  contained  only  36  explicit  grounded 
capacitors  out  of  151  nodes.  Diodes  were  used  on  the  remainder  of  the  nodes 
to  model  the  parasitic  junction  capacitance  effects.  This  was  a  sufficient 
condition  to  obtain  convergence  at  every  timepoint. 
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In  this  case,  SP1CE2  required  approximately  3  hours  to  produce  a  solu¬ 
tion  whereas  SPUCEl.7  required  only  10  minutes.  A  comparison  of  the  out¬ 
put  waveforms  is  given  in  Fig.  4.13.  The  results  are  very  close  in  ail  cases 
except  in  a  few  instances  where  SP1CE2  exhibits  point-to-point  ringing.  This 
is  a  product  of  the  trapezoidal  integration  method  used  by  default  in  SP1CE2. 
which  allows  it  to  take  larger  timesteps  but  may  cause  ringing  if  the  timestep 
is  too  large.  SPLICEl.7  uses  a  Backward-Euler  integration  scheme,  and  for 
this  method  no  numerical  ringing  is  present  in  the  waveforms. 


Circuit 

Mosfets 

Diodes 

Nodes 

64K  CMOS  SRAM 
344 

277 

151 

Time 

(a) 

Memory 

(Kbyte) 

SPICE2G 

SPUCEl.7 

10448 

623 

506.3 

49.9 

Ratios 

16.75 

10 

Table  4.4 

Comparison  of  conventional  circuit  simulation, 
and  ITA  for  an  Industrial  CMOS  64K  SRAM 


4.4.5.  NH05  OpAmp  Example 

Although  ITA  was  developed  for  the  simulation  of  large  digital  circuits, 
the  algorithm  is  robust  enough  to  accurately  simulate  complex  analog  cir¬ 
cuits.  Therefore,  an  integrated  circuit  consisting  mainly  of  digital  circuitry 
along  with  a  few  analog  blocks,  typically  found  in  telecommunication  circuits 
and  memory  chips,  can  be  simulated  without  any  special  precautions,  other 
than  the  usual  requirement  of  some  grounded  capacitance  at  every  node. 


Rg.  4.13 :  Output  waveforms  of  94K  RAH  circuit  from  SPUCE1 
•nd£P!CE2.  Note  the  ringing  produced  in  the  SPICE 
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To  illustrate  the  capability  of  the  ITA  method,  the  OpAmp  in  Fig.  4.14  was 
simulated  using  SPLICEl.7  and  SPICE2.  All  the  parasitic  capacitances  associ¬ 
ated  with  each  transistor  were  fully  represented.  As  shown  in  the  schematic 
diagram,  there  is  a  large  lOpF  compensation  capacitor  providing  a  capacitive 

feedback  path  in  the  circuit.  The  transistor  at  the  output  is  to  provide 

0/4. 

high  gain  at  the  output  node.  The  circuit  was  connected  in  a  unity-gain 
configuration  and  a  step  voltage  was  applied  at  the  input.  Fig.  4.  IS  is  a  com¬ 
parison  of  the  outputs  of  SPLICEl.7  and  SPICE2.  The  results  are  identical 
except  in  the  neighborhood  of  time  t=0  due  to  slightly  different  initial  condi¬ 
tions  assumed  by  each  program.  However,  the  execution  time  of  SPICE2  was 
two  times  faster  than  SPLICEl.7  because  of  the  size  and  nature  of  the  circuit. 
It  is  expected  that  this  difference  will  be  reduced  as  the  program  is 
developed  further. 

In  general.  ITA  may  be  slower  than  SPICE2  when  simulating  small  analog 
circuits  because  : 

•  they  usually  contain  large  feedback  paths  and  high  gain 

•  there  is  little  or  no  latency  in  a  typical  analog  circuit 

Therefore,  the  accuracy  tolerances  for  the  simulation  (abstol,  reltol) 
must  be  tight  and  the  scheduling  threshold,  mindvsch,  must  be  very  small. 
There  may  also  be  a  requirement  for  a  small  simulation  timestep  to  guaran¬ 
tee  convergence.  Therefore,  a  dynamic  timestep  control  mechanism  is 
essential  for  the  simulation  of  mixed  analog/digital  circuits  to  ensure  that 
the  global  timestep  will  not  be  over  constrained  by  the  analog  circuits.  Each 
node  could  have  a  local  timestep  which  is  based  on  its  own  Local  Truncation 
Error  estimation.  Another  useful  feature  would  be  to  allow  parameters  such 
as  abstol,  reltol  and  mindvsch  to  be  specified  on  a  per-node  basis,  in  this 
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way,  certain  nodes  would  be  forced  to  iterate  longer  than  others  to  ensure 
accuracy  at  these  nodes.  These  and  other  techniques  may  be  used  to 
improve  the  performance  of  the  simulator  for  handling  analog  circuits, 
although  1TA  is  not  ideally  suited  to  the  task. 

4.4.6.  CPU-time  vs.  Circuit  Size 

The  run-times  for  the  circuits  described  in  this  section  are  plotted 
against  the  circuit  size  in  Fig.  4.16.  It  is  clear  from  this  plot  that  SPLICE1.7 
is  much  faster  than  the  SP1CE2  program  for  large  circuits.  In  fact,  as  the 
circuit  size  increases,  the  improvement  factor  increases.  This  is  due  to  the 
fact  that  the  linear  equation  solution  time  in  SPICE2  increases  rapidly  with 
the  circuit  size,  as  described  in  Chap.  2.  The  run-time  in  SPLICE1.7  is  pro¬ 
portional  in  the  activity  in  the  circuit  and  the  mrt  rather  than  circuit  size.  If 
the  circuit  is  small,  the  standard  approach  is  usually  more  efficient 

4.5.  Mixed-Mode  Examples 

To  complete  this  section,  a  pair  of  examples  are  presented  using  a 
logic/switch  combination  and  a  logic/electrical  combination.  The  circuit  to 
be  simulated  is  a  CMOS  4x5  multiplier  with  a  4-bit  counter  to  generate  a  test 
sequence,  as  shown  in  block  form  In  Figs.  4. 17(a)  and  4. 17(b).  The  entire  cir¬ 
cuit  is  shown  in  Fig.  4.18.  The  multiplier  uses  the  novel  adder  circuit  illus¬ 
trated  in  Fig.  4.19(a)  and  the  exclusive-OR  circuit  of  Fig  4.19(b).  It  is  clear 
from  this  figure  that  the  adder  would  be  difficult  to  represent  using  Boolean 
gates.  Therefore,  a  sw5  sh-level  description  is  appropriate.  For  the  simula¬ 
tion,  the  multiplier  was  connected  in  a  multiply-by-2  configuration  by  setting 
the  B-bits  to  2.  The  counter,  shown  earlier  in  conjunction  with  the  CDE  cir- 
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Fif .  4.16  :  A  plot  of  the  results  obtained  uxing  SPLJCEl  and  SPICE. 
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cuit,  was  represented  at  the  logic  level  using  boolean  gates.  It  provided  a 
test  sequence  which  was  applied  to  the  A-bits. 

Initially,  the  operation  of  the  multiplier  was  verified  at  the  switch-level. 
This  required  only  7.9  CPU-seconds,  mcbiding  the  simulation  of  the  counter 
circuit.  The  output  of  this  simulation  is  given  in  Fig.  4.20(a).  Note  that  the 
outputs,  pO,  pi,  p2  and  p3,  are  evaluated  at  the  input  edges,  since  the  simu¬ 
lator  is  operating  in  zero-delay  mode  at  the  switch-level. 

After  debugging  the  circuit  at  the  switch-level,  the  description  of  the 
multiplier  was  changed  from  a  switch-level  description  to  an  electrical-level 
description  by  simply  changing  the  underlying  models  associated  with  the 
transistors.  The  majority  of  the  description  was  left  unchanged.  The  counter 
circuit  description  at  the  logic  level  was  left  in  the  circuit  to  generate  the 
test  inputs  for  the  A-bits,  as  before.  Logic-to-Voltage  converters  were 
inserted  where  necessary.  This  simulation  required  682.7  seconds.  The  out¬ 
put  of  the  simulation  is  shown  in  Fig.  4.20(b). 

Some  useful  ways  to  use  the  mixed-mode  capability  in  SPLICEl  have 
been  illustrated  in  this  example: 

(1)  One  can  debug  the  circuit  very  efficiently  using  zero-delay  switch-level 
simulation.  Then,  a  more  detailed  simulation  can  be  performed  to 
determine  exact  delays  at  the  electrical  level  with  very  few  changes  in 
the  circuit  description,  if  the  design  is  described  hierarchically. 

(2)  A  complicated  logic  circuit  can  be  used  to  generate  test  inputs  for  an 
electrical  simulation  as  opposed  to  using  logic  sources  as  input.  In  fact, 
a  master  clock  signal  was  the  only  input  waveform  for  the  mixed-mode 
simulations  described  here. 
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(3)  A  complicated  circuit  can  be  decomposed  into  small  blocks  and. each 
block  can  be  simulated  with  1TA  electrical  simulation.  Once  each  block 
has  been  checked,  a  switch-level  model  can  be  generated  which  matches 
the  logic  characteristics  of  the  cells.  These  blocks  can  then  be  combined 
for  a  switch-level  analysis  of  the  entire  circuit  which  is  relatively  inex¬ 
pensive  compared  to  electrical  simulation. 

The  results  of  the  simulations  are  summarized  in  the  table  below. 


Circuit 

Mosfets 

Multiplier  Nodes 
Counter  Gates 
Counter  Nodes 


4x5  Multiplier 
545 

248 

124 

130 

Time 

(s) 

Memory 

(Kbyte) 

7.9 

882.7 

64.5 

88.3 

5.  CONCLUSIONS 


CHAPTER  5 


SPLICE  1  has  been  greatly  improved  by  incorporating  the  new  techniques 
described  in  this  report.  As  evidenced  by  the  statistics  in  Chap.  4,  the  new 
electrical  simulation  approach,  ITA,  is  substantially  faster  than  SPICE2  and 
requires  far  less  storage.  This  method  has  shown  so  much  promise  that 
efforts  are  underway  to  generalize  it  as  a  standard  circuit  simulation 
approach.  As  pointed  out  earlier,  the  major  problem  with  the  method  is  the 
number  of  iterations  required  to  obtain  a  solution  when  floating  capacitors 
are  present  in  the  circuit  As  the  prototype  program  is  developed  further,  it 
is  expected  that  the  performance  characteristics  will  be  significantly  better 
than  SPICE2.  The  ITA  method  provides  a  way  to  efficiently  simulate  large 
digital  circuits  and  it  may  replace  the  standard  approach  in  this  application. 
It  is  also  suitable  for  implementation  on  special-purpose  hardware  and  work 
is  underway  in  this  area.  Other  areas  of  future  work  include  the  extension  of 
the  method  to  use  Modified  Nodal  Analysis,  dynamic  timestep  control  and 
error  control  mechanisms. 

The  logic  analysis  in  SPLICE1.7  has  been  enhanced  to  perform  true-value 
logic  simulation  using  a  strength-oriented  MOS  model.  This  not  only  allows 
accurate  modeling  at  the  logic  level  but  also  provides  a  mechanism  to  per¬ 
form  accurate  mixed-mode  simulation.  There  is  still  work  to  be  done  in  the 
area  of  strength  modeling  for  logic  elements  to  define  the  electrical/logic 
interfaces  more  accurately.  SPLICE  1.7  handles  logic  transfer  gates  in  a  con¬ 
sistent  manner  but  the  CNR  method  is  not  appropriate  for  a  multiprocessor 
architecture.  There  is  also  the  issue  of  delay  modeling  at  the  switch-level 
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which  has  not  been  addressed  here.  Research  is  currently  being  directed  at 
applying  multiple  iterations  at  the  logic  level  to  determine  state  and  delay 
information  in  transistor-level  logic  circuits. 

In  conclusion,  the  concepts  presented  in  this  report  suggest  that  con¬ 
sistent  electrical  and  logic  simulation  can  be  performed  at  the  transistor- 
level  using  relaxation-based  algorithms  and  event-driven  selective  trace 
techniques. 
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APPENDIX  I 


Input  Flies  for  Example  Circuits 


The  example  circuits  may  be  obtained  from  the  University  of  California  at 
Berkeley 
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APPENDIX  U 


SPLICE1.S  Data  Structures 


(l)  Nodes:  The  node  data  structure  is  set  up  in  GENFS  for  the  logic,  electrical 
and  vraii  nodes. 


LOGIC  NODE  : 


offset 


0 


abbrev. 


foD 


fin 


tVDe 


ts* 


Ival 


modptr 


dectim 


definition 


fanout  oointe 


fanin  Dointer 


1  (for  logic  node)  =-l  (for  logic  output  node 


fanout  schedule  time 


logic  value  (3-bits  for  current  value  b2blb0 
3-bits  for  previous  value  b5b4b3 
1=0.  2*1.  3=X) 


logic  strength  (16-bits  for  current  value 
16-bits  for  previous  value 

minimum  strength  =  1  ;  maximum  strength  =  85,536 


1:  capacitance  at  node 
2:  node  decay  delav  value 


node  decav  time 


ELECTRICAL  NODE 


offset  I  abbrev. 


,  fin 


type 


:  ts- 


i  Vn-l 


Vn-2 


I  capptrs 


definition 


fanout  pointer 


fanin  pointer 


=  2  (for  electrical  node) 

=-2  (for  electrical  output  node 


fanout  schedule  timeUast  time  or  next  time 


current  node  voltage 


revious  node  voltage 


oints  to  node  capacitance  values  in  rvals 


last  time  processed  (associated  with  Vn-l 


revious  time  processed  (associated  with  Vn-2) 


VRAIL  NODE 


offset  I  abbrev. 


definition 


1  (not  used 


-1 (not  used 


2 

1  tvoe 

1  =5  for  a  vraii  node 

3 

!  vn 

!  current  node  voltage  =  constant 

revious  node  vol 


constant 


INTEGER  information  is  typically  accessed  using  the  nodptr  array 

i.e.  info  =  imem(nodptr+locnod+ipos) 

imem  :  integer  memory  maintained  by  memory’  manager 
nodptr  :  node  information  data  structure  origin 
locnod  :  position  of  1st  piece  of  info  for  node 
ipos  :  position  of  desired  info 

REAL  information  is  accessed  through  one  more  level  of  indirection: 

i.s.  capacitance  =  rmem(rvals+imem(nodptr+locnod+5)) 

rmem  :  real  memory  maintained  by  memory  manager 
rvals  :  origin  of  real  value  array 

(2)  Fan  in  and  Fanout  Lists:  Fanin  and  fanout  lists  are  stored  with  the  node  data 
structure.  Fanins  to  a  node  are  all  elements  which  can  affect  the  value  of 
the  node.  Fanouts  of  a  node  are  all  elements  which  can  be  affected  by  a 
new  value  at  the  node.  They  are  set  up  in  the  LCGFA,  TIMFA  and  ENDFA  sub¬ 
routines. 


locfol:  0 

1 

2 
3 


a 


If  there  is  only  one  element  in  the  fanin  list  (which  is  often  the  case),  then 
this  list  does  not  exist  The  fll  pointer  in  the  node  data  structure  has  a  -ve 
sign  to  denote  that  it  is  the  element  pointer  itself. 


locfll:  0 

1 

2 
3 


a 


schedular  link 
element  1  ptr 
element  2  ptr 
element  3  ptr 


-  element  n  Dtr  ! 


unused  location 
element  1  ptr 
element  2  ptr 
element  3  ptr 


-  element  n  ptr 


(3)  Models;  SPLICE1  stores  model  information  using  two  levels  of  indirection  so 
that  one  model  may  be  referenced  by  many  elements. 

model  info  pointers  are  stored  in  an  array  called  mdmptr: 


mdmptr:  0 

1 

2 
3 


a 


locmod  points  into  a  table  called  modptr  which  is  organized  as  follows: 


locmod  0 
locmod  1 
locmod  2 
locmod  3 


locmod  n 


modptr: 


o 

1 

2 
3 
A 
5 


modtyp  1 
locpar  1 
modtyp  2 
locpar  2 
modtyp  3 
locpar  3 


(model  type) 

(location  of  parameters) 


locpar  points  into  rvals  which  is  an  array  of  floating-point  quantities  and  so 
parameters  are  accessed  as  follows: 


parameter  =  rmem  (  rvals  +  locpar) 

The  rvals  array  is  just  a  set  of  real  values  in  the  rmem  space. 


rvals : 


0 

rvalue  0 

1 

rvalue  1 

2 

rvalue  2 

3 

rvalue  3 

• 

n 


rvalue  n 


(4)  Elements  Elements  are  initially  written  out  to  scratch  files  (timel,  logel)  by 
the  routine  SAVEL.  Once  they  are  read  back  in,  they  are  stored  in  the  array 
elmr'~  with  the  following  format: 


elmptr : 


0 

-modnum 

1 

noutputs 

2 

nodel 

3 

node2 

4 

node  3 

• 

1 

-modnum 

i+1 

noutputs 

1+2 

nodel 

i+3 

node2 

• 

nlogwds+O 

nlogwds+1 

-modnum 

nlogwds+2 

noutputs 

nlogwds+3 

nodel 

nlogwds+4 

node2 

(first  logic  element) 
(number  of  outputs) 


(second  logic  element) 
(number  of  outputs) 


(last  logic  element  node) 
(first  electrical  element) 
(number  of  outputs) 


ntimwds+0 


(last  electrical  node) 


(5)  Schedular  The  time  queue  is  made  up  of  2  -  100  word  arrays  and  a  pool  for 
any  events  which  do  not  fail  within  200  timepoints  of  the  beginning  of  the 
queue. 

QUEUE  1  time 
iscbl  :  _  0 


Iscbl  : 


jUEUE  2  time 

_  0 

1 

2 


Iscb2 


iscb3 : 


POOL 
TIME  1 
LQCFOL 1 
TIME  2 
LQCFOL 2 
TIME  3 
LOCFOL 3 


lscb3  : 


v» 


Electrical  Element  Model  Equations 


1.  Resistors 


St 


1_ 

R 


I*q  ~ 


(V1-V2) 

R 


2.  Floating  Capacitors 


Cflaat 

h 


Itq  —  Afloat 


(yn-^n-1)  -(V2n  .  ^n-1) 


3.  Transistors 
a.  Triode  Region 


Drain  node 

G*  =  vCo,  ji(  V -  VT-~i  K* X+(  1.0+X  /*,(  -  Vjr 


Source  Node 

GW  =  MO-  j<(^-Kr+lW  ;^===^(1.0+XK*)+(7,,-^ 

b.  Saturation  Region 


^(i.o+x7dt)(ygl-yr)g 
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APPENDIX  IV 


Source  Code  for  SPLICE  1 


The  SPLICE  1  program  is  available  in  the  public  domain  from  the  University  of 
California  at  Berkeley 


BLOSSOM  lAn  Algorithm  and  Archi Lecture  ior  the 
Solution  of  Large-Scale  Linear  System 


Ko  and 

Atbrrta  fSangiovarmx-  K nctrd  t  Ui 


Department  of  Electrical  Engineering  and  Computer  Science 
U.C.  Berkeley 


Abstract 

Block  LU  factorization  and  other  partitioned  matrix  algo¬ 
rithms  are  used  in  a 'VLSI  computing  system.  BLOSSOM,  to 
solve  very  large-scale  matrix  problems.  Vitb  only  one  type 
of  VLSI  arithmetic  processing  unit  needed  for  submatrix 
computations,  a  re  configurable  processor  array  Is  used. 
Since  numerical  properties  are  vital  to  matrix  computa¬ 
tions.  we  use  neighbor  pivoting  on  submatrices  with  scalar 
elements  and  partial  pivoting  on  full  matrix  with  subma-, 
trices  as  elements.  Natural  topological  structures  of  certain' 
ttnds  of  operand  matrices  are  also  exploited  to  belp  reduc-: 
ing  pivoting  costs.  BLOSSOM  is  designed  such  that  subma-  ( 
trices  of  different  sizes  can  be  efficiently  processed  even' 
when  the  number  of  processing  units  is  less  than  that  of  the' 
submatrix  elements.  Comparisons  with  other  special 
hardware  schemes  are  provided  as  a  reference.  I 

1.  Introduction 

The  solution  of  large-scale  linear  System  of  algebraie 
Equations(LSE)  is  needed  in  the  analysis  and  simulation  of 
many  engineering  systems.  Finite-element  analysis,  circuit 
simulation. and f>ower  system  analysis  are  few  examples.  In, 
these  applications,  sparse  matrix  techniques  have  been1 
used  extensively  to  speed  up  the  solution  process.  The  time1 
complexity  of  these  techniques  has  been  experimentally 
eitimated  to  be  0(n*),  where  l.Zxaxl.S  and  n  is  the 
number  of  equations.  When  analyzing  very  large  scale 
systems(more  than  10.000  equations),  the  computational 
complexity  of  these  techniques  makes  the  solution  process 
very  expensive  and  the  use  of  large  main-frames  such  as  the 
1BM30S1  indispensable. 

New  architectures,  in  particular  vector  computers  such 
as  the  CRAY  1.  have  inspired  the  design  of  new  algorithms  to 
esplolt  parallelism  in  the  solution  process.  An  important 
example  is  the  program  CLASS1E[14J  for  the  simulation  of 
electronic  circuits.  Along  these  lines,  peripheral  array  pro¬ 
cessors.  tucb  as  the  FPS184,  can  also  be  used  in  conjunction 
with  hosts  such  as  the  VAX1 1/760  to  speed  up  the  solution 
process.  However,  this  speedup  is  not  enough  to  cope  with 
the  problems  to  be  solved  in  the  VLSI  era. 

Tbe  advent  of  VLSI  technology  has  made  the  cost- 
effective  design  of  special  purpose  machines  possible.  Exam¬ 
ples  of  these  machines  are  the  Yorktown  Simulation 
SnginefYSE)  for  logic  solution^]  and  Systolic  Arrays  [  1 1  ]. 
Special  purpoee  machiaes  have  also  been  proposed  for  the 
wolution  of  LSE[7. 9. 5. 15].  Most  of  these  machines  limit  the 
wire  of  the  operand  matrix.  When  no  size  limit  is  imposed,' 
the  operand  matrix  has  to  be  partitioned  into  submatrices 
of  equal  sizes.  Only  Johns  son  in[7]  and  Pottle  in[B]  treated 
the  related  numerical  properties  and  matrix  sparsity  is 
exploited  only  m(9].  However  special  matrix  structures, 
ouch  as  the  Bordered  Clock  Diagonal  Form  (BBDF)  or  the 
Bordered  Block  Triangular  Form(BBTF),  commonly  expected 
In  engineering  problem,  ore  not  exploited  in  [9].  In  this' 
paper,  we  propose  a  new  algorithm-architecture  BLOSSOM 
for  the  solution  of  LSE. 


The  paper  is  organized  as  follows.  In  Section  2  we  pro¬ 
pose  a  variant  of  block  LU  decomposition  with  block  neigh¬ 
bor  pivoting  to  ensure  numerical  stability  and  accuracy.  A 
parallel-pipeline  architecture,  described  in  Section  3.  is 
{designed  to  implement  the  block  LU  decomposition  This 
.architecture  supports  other  matrix  operations  used  as  sub- 
iprooedures  by  block  LU  decomposition  such  as  the  multipli- 
-cation  and  the  inversion  of  submatrices.  In  Section  4.  we 
describe  the  hardware  implementation  of  these  matrix 
operations,  finally,  a  comparison  with  other  hardware 
schemes  is  lilted  in  table  L 

E.  Hock  LU  Factorization 


Let  A  £RW,‘V  be  partitioned  into  n*  submatrlces  as 
shown  in  Fig.l.  where  lxtsn  and  Note 

<»i 

1st  and  Ufi  denote  respectively  tbe  submatrices  of  the 
block  L.U  factors  of  Ain  the  (i.«)  and  (j,i)  positions.  Tbe  fol¬ 
lowing  algorithm  performs  block  LU  decomposition; 

MgoHihm  1.  1.  Block  LU-dt  composition: 

(Dd-H 

(2)  Confute  X*  -  £  A*  U*, ; 

(3)  Connate  the  inverse  of  J^,.  denoted  by  i^"1 ; 

(4)  Ifswn,  stop. 

,  (5)  Compute  tor  all  k,  i  cfcsn  : 

tfc: 

s.i 

(8)  »=i  +  l;  go  to  (2); 

Note  that  tbe  structure  of  this  algorithm  is  the  same  as 
CroiA's  method  except  that  each  step  is  based  on  subma- 
triess  instead  *f  scalar  elements.  Thus,  the  block  L.  V  fac- 
■tors  ore  not  triangular  matrices  but  rather  block  triangular 
matrices  and  -the  multipliers  t,  in  step  3  are  computed  by 
means  of « matrix  inversion  instead  of  scalar  inversion. 

For  the  algorithm  to  be  well-defined,  it  is  necessary 
that  be  nonsingular.  George) 5]  computed  the  block  LI 
factorization  under  the  condition  that  the  operand  matr.x  is 
diagonally  dominant.  However,  Bunch)  ij  pointed  out  that 
diagonal  dominance  of  A  is  not  sufficient  to  guarantee  tha" 
the  algorithm  is  well  defined  Bunch  co-ijectured  tna'. 


m 


;• 


block  diagonal  tlomimnci  *  is  ■  sufficient  condition.  The 
following  -theorem  states  that  strictly  block  diagonal  domi¬ 
nance  guarantees  the  nonsingularity  of  the  diagonal  suhma- 
tnces  of  each  reduced  matrix  of  A.  which  implies  the 
existence  of  blodk  LjU  factors.  The  proof  can  be  found  in 

£w]. 

Existence  Theorem  :  Let  A  be  a  nonsingular  NxN  matrix 
partitioned  as  in  Tig.  1.  If  A  is  strictly  block  row  diagonally 
dominant,  then  eacb  reduced  matrix  of  A  in  block  Gaussian 
elimination  is  Strictly  block  (row)  diagonally  dominant. 
Remark:  The  -uniqueness  of  block  LU  factors  Is  proved  in 

no]. 

Sparsity  has  been  exploited  extensively  in  scalar  USE 
solution  algorithms  to  reduce  computational  complexity. 
This  concept  is  used  block-wise  in  BLOSSOM.  Operations  per¬ 
formed  by  BLOSSOM  involve  nonzero  blocks  only.  To  exploit 
the  Sparsity  further,  concurrent  subtasks  of  block  LU 
decomposition  on  nia trices  of  special  topological  structures 
iruch  as  BBDF.  SBTF  tan  be  executed  in  parallel  hy  BLOS¬ 
SOM.  These  subtasks  “are.  for  example,  inversion  of  diagonal 
subma'.rices  on  the  block  diagonal  of  a  matrix  in  BBDF. 
forming  of  L  factors bh  the  i’th  block  column,  etc. 

Dahlquist  et  al[2]  indicated  that  a  compact  scalar  Caus-' 
sian  elimination  algorithm  can  not  incorporate  complete  or 
diagonal  pivoting  easily  because  of  the  order  in  which  LU 
factors  are  computed.  This  limitation  also  applies  to  block 
LU  decomposition.  "When  partial  pivoting  is  used  in  block  LU 
decomposition,  square  blocks  in  the  same  block  column  of 
the  present  pivot  are  checked  to  find  the  one  with  smallest- 
norm  inverse.  We  add  partial  pivoting  to  algorithm  1.1  as  fol¬ 
lows:  after  Xm's  ere  computed  (including  £*).  compute  their 
inverses.  Compare  the  norms  of  these  inverses  to  determine! 
a  new  pivot,  and  then  compute  £/u's.  The  hardware  imple-l 
mentation  of  hlock  LU  decomposition  is  thus  based  on  algo¬ 
rithm  1.1  with  partial  pivoting  added. 

9.  System  Architecture 

The  BLOSSOM  system  consists  of  a  host-interface,  an 
executive  control  -unit(ECU).  a  sequencer,  and  a’ 
re  configurable  processor  array  as  depicted  in  Fig  .2.  The  fol¬ 
lowing  is  a  behavioral  description  of  each  functional  hlock. 
3.1.  Hoet-Interfaoe 

The  host  machine  acts  as  an  intelligent  interface 
between  the  user  and 'BLOSSOM.  A  request  for  a  LSE  solution 
issued  by  the  user  is  sent  to  BLOSSOM  by  the  host  machine. 
If  BLOSSOM  is  not  being  occupied,  data  will  be  loaded  into 
BLOSSOM,  freeing  tbe  host  for  other  duties. 

BLOSSOM  is  intended  to  be  capable  of  being  attached  Is/ 
different  kinds  of  host.  For  this  reason,  the  host-interface 
consists  of  two  parts.  The  first  part  is  dedicated  to  accom¬ 
modate  different  host  machines  by  providing  facilities  such 
as  converting  integer  numbers  into  floating  point  numbers 
and  matching  different  word  lengths.  The  second  part  ini¬ 
tiates  block  LU  decomposition  as  described  in  the  sequel. 

The  data  path  and  control  path  are  generally  kept 
eeparale  in  most  processor  designs.  In  BLOSSOM,  the  opera¬ 
tion  set  is  very  small  and  tbe  amount  of  data  Is  very  large. 
Hence  control  signals  are  embedded  into  the  data  stream. 
Since  the  host-interface  recognises  only  partitioned 
matrices  and  vectors,  the  host  provides  a  data  separator  for 
each  submatrix  and  vector  segment  in  the  data  stream.  The 
host-interface  generates  a  proper  date  representation  each 
time  a  separator  is  encountered.  This  data  representation 
and  the  instruction  words  are  then  sent  to  ECU,  while  data 
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are  tent  to  tbe  main  memory. 

3.2.  Executive  Control  Umt(BCU) 

The  ECU  decodes  host-interface  instructions  on 
matrices  into  sequencer  instructions  on  submatrices  and 
vector  segments.  Thus  each  task  requested  by  the  host  is 
partitioned  into  several  subtasks  and  each  subtask  is  car- 
iried  out  by  tbe  processor  array  under  supervision  of  the 
I  sequencer. 

I  Tbe  set  of  sequencer  instructions  corresponding  to 
(each  host  instruction  is  stored  in  a  control  memory  of  ECl . 
iand  the  atatua  of  the  sequencer  is  monitored  by  the  ECU. 
The  ECU  also  generates  control  signals  indicating  whether 
processors  in  adjacent  rows  (columns)  should  be  connected. 
We  denote  them  by  RECONr.i  (RECONc.j). 

I  The  sub  tasks  under  <erecution  by  the  sequencer  are 
Imonitored  by  the  ECU.  Whenever  a  suhtask  is  completed, 
the  ECU  can  perform  ona  of  the  following  actions  on  the 
next  sifbtask  on  the  queue:  execute  it  immediately,  wait 
until  more  processors  become  available.  or  wait  until  all 
processors  become  available.  The  choice  among  these  pos¬ 
sible  actions  is  determined  by  the  sizes  of  the  subtask  and 
the  available  processor  subarrays. 


3.3.  Sequencer 

The  ECU  generates  concurrent  requests  for  the 
sequencer,  which  in  turn  initiates  and  monitors  actions  in 
the  processor  array.  The  control  logic  in  the  sequencer  is 
■then  divided  into  four  parts:  the  Task  Control  Unit  (Tel), 
the  Memory  Management  Unit  (MVU).  the  Sequencing  Unit 
(5U),  and  the  Feedback  Buffer  (FB).  as  shown  in  Fig. 3. 
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HI  Task  Codt/oi  Unit  *  Memory  Huu|<ni<at  Unit 

The  TSU  interfaces  With  the  ECU  end  maintains  the 
■talus  registers.  "Each  instruction  received  from  the  ECU  is  — 
directly  added  (instead  of  decoded)  into  the  data  stream 
that  is  loaded  by  the  MMU  Wnd  sequenced  by  the  SU. 

Tbs  MMU  manages  the  main  memory  system.  Since  the 
amount  of  data  of  each  operand  matrix  can  be  very  large,, 
virtual  addressing  scheme  is  used.  The  operand  address 
reserved  from  the  ECU  is  the  starting  virtual  address  of  a. 
submalru  or  vector  segment.  The  Mfli  computes  the  end-, 
big  virtual  address  from  It  and  translates  both  virtual, 
addresses  to  physical  addresses.  The  MMU  is  required  to 
fetch  data  for  concurrent -tasks  from  the  multiport  memory,  • 
The  bost -interface  would  try  to  allocate  potentially  parallel 
data  objects  into  different  parts  of  the  memory  hierarchy  to' 
be  fetched  through  (different  porta  However,  memory 
conflicts  are  possible  when  pivoting  is  employed.  If  conflicts 
happen,  the  MMU  would  irform  the  TSU.  The  latter  can 
aitherreject  the  subtask  «nd  return. it  to  ECU.  or  keep  the 
subtask  m  its  queue  till  the  conflict  disappears.  I 

3.3.2.  Sequencing  Unit  4c  feedback  Buffer 

The  SU  adds  different  delays  to  data  loaded  by  the  1QIU 
before  sending  them  to  different. processor  rows  or  columns 
according  to  the  requirements  of  the  computing  algorithms. 
The  computation  of  the  number  of  these  delays  is  done  in 
the  SU  under  the  control  of  the  TSU. 'Since  the  delay  is  uni¬ 
form  from  processor  column  to  column  and  from  processor 
row  to -row  in  each  sifbtask.  only  :one  computation  is  needed 
for  each  subtask.  The  computations  for  different  subtasks 
are  done  sequentially  to  limit  the  hardware  complexity. 

The  TB  is  used  when  the  data  output  from  the  proces¬ 
sor  array  is  resent  to  the  processor  array  again.  To  the  SU, 
the  TB  is  a  multiport  buffer  between  it  and  the  processor 
array.  The  FB  has  the  capability  to  reverse  the  existing 
delay  relationship  among  •  raw  of  data. 

The  processor  array  consists  it  processor  elements, 
local  data  links.  local  control  links. ‘SU  data  registers.  FB 
data  registers,  and  the  ECU  RECON  hoses,  as  shown  in  Fig.4. 
The  local  data  links  carry  multiplexed  data  words.  There  is 
one  ECU  RECON  bus  per  processor  row  and  one  per  proces¬ 
sor  column. 

Each  subprocedure  used  by  Mock  LU  decomposition! 
has  a  corresponding  microprogram --stored  in  each  proces¬ 
sor.  which  inactivated  when  that  subprocedure  is  initiated, 
by  the  sequencer.  These  subprocedures  assume  that  all, 
segments  of  a  row  (or  column)  of  an  operand  are  stored  in 
one  processor  row  (or  column).  A  register  file  in  each  pro-| 
cessar  is  used  as  a  queue  to  enable  such  a  storage  scheme/ 
Each  processor  has  four  data  bases  connecting  to  its  neigh-' 
hors  which  ere  -named  "'data  porta".  The  processors  also 
make  individual  decisions  on  whether  connections  to  neigh¬ 
boring  processors  are  maintained  or  not  by  invoking  proper 
microprograms. 

4.  'Partitioned  Matrix  Operation 

The  procedures  employed  by  block  LU  decomposition 
may  use  any  wall  forms  of  matrix,  rauhmatrix.  vector,  vec¬ 
tor  segment,  and  scalar  for  their  operands.  A  procedure  Is 
implemented  by  a  string  of  consnands  issued  by  Its  control¬ 
ling  module,  wnd  is  labeled  by  its  controlling  module.  For 
example,  block  LU  decomposition  is  ECU  controlled.  Due  to 
the  present  space  limitation  in  this  paper,  only  submatnx! 
inversion  is  described  in  details  here.  Other  operations  arej 
listed  for  reference  only.  I 

4. 1.  Pipelined  Inversion  of  a  Suhznatrtx 

Let  PeW*  be  the  operand  matrix  and  /OP1*  be  the 
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■unity  matrix. 'The  pipelined  inversion  is  essentially  Caussian 
elimination  applied  to  the  augmented  matrix  [P  I]  resulting 
In  [IIP-*}.  The  procedure  is  listed  below- 


I 


fory*l  top  { 

if  is; 


for  all  s  =  l  fop: 


then  P^-P^-P^Pjt,  tor  all  k. i  =  l  to 
ft*  =/<i for  all  k, i  =  l  fop; 


P: 


Note  that  B„=[P. y  |  Im].  and  P„.  /„  are  scalars 

Two  important  features  are  added  to  the  hardware 
Implementation  of  this  procedure:  'folding  of  operand  sub- 
matrix"  and  "neighbor  pivoting".  The  inverting  process  is 
partitioned  into  two  phases  The  first  phase  is  triangulanza- 
tion  with  neighbor  pivoting  Backward  substitution  is  the 
second.  The  two  phases  are  executed  as  two  contiguous 
steps  following  a  one-time  loading  of  data. 

When  neighbor  pivoting  is  uied.  the  leading  element  of 
the  next  adjacent  row  is  reduced  to  xero  by  adding  a  multi¬ 
ple  bf  the  current  fow.  When  this  multiplying  factor  exceeds 
unity,  the  former  row  becomes  the  current  row  and  the 
latter  has  Its  leading  element  reduced  to  zero.  Thus  this 
factor  is  kept  fractional,  which  suggests  that  this  triang  ular- 
ixation  process  is  numerically  stable  as  confirm'd  b> 
empirical  resultx[l5. 13]. 

Let  Pj< k>  denote  the  k'th  segment  of  the  fth  row  of 
"P"  and  f$<k>g  denote  the  g'th  nonzero  element:of  Pj  <k>. 
The  same  notation  ^iplies  to  the  unity  matrix  "1  .and  every 
operation  applied  to  f*i  Is  applied  to  Ij  as  well  >'e  desenbi 
next  the  implementation  of  the  algorithm  on  a  per- 
procesfor-raw  basis.  Tbe  following  symbols  are  used  Pe 
denotes  the  current  pivot  row  on  that  particulariprocessor 
row.  Pi  the  tow  that  is  currently  processed,  fit  tbe  ro»  that 
has  Its  leading  element  reduced  to  zero  and  sent  tx>  the  next 
processor  row,  "pr"  the  processor  row  index.  ~\no'  the 
iteration  nnmbar  of  the  whole  operand  in  itlie  processor 
arsay,  "yxg"  tha  size  of  tbe  processor  array,  and  "h  the 
number  of  aegmentx  foriany  or.e  column  (or  row)  o'  P  after 
folding. 


Algorithm  4.2  1  THtwyrularuation 

(1)  «no  =  l: 

(2)  If  Pi<eu>l  <  Pi  <ano  >1  then  go  to  (6); 

_ FVeinoM  . 

(3J  m  P*<ino>T' 

(4)  Ptr+Pi—mPi; 


«  •  .  w  «,  a  •  **  ■  “  w  * 


miv.V-'-  'V 


■*_'  C— ,  r.TT.  JT-j 


I 


> 


I 

c 

c 

r • 


.  1 

i 

jv 

$ 

V 

(C 

i 


u» 


3 

I 

4  * 


V.* 

M 

% 

>f! 

•f. 

-*. 

*■. 

V. 

A, 

s 

rt 


Vi 

«.•  V* 

VVj\f 


(5)  go-to  (9); 

(8)  mx-^w)i.; 

,  A<ino>l 

(7)  Prr*-Pe—mPi. 

(8)  Pr+Pi: 

19)  #pi=#pt+l: 

1C)  lf|fpi<(p-g(tno-l))  then  go  to  (14); 
ll)#jh  =  l; 

12)  ino  =tao  + 1; 


<13)  n*sjw+j(tno— l) ;  Atu  =  A; 

(Wj  IfanotA  then  go  to  (2). 

(15)  stop; 


In  this  algorithm,  each  element  of  a  segment  Of  a  row  of 
~P"  is  processed  in  a  way  that  it  lags  or>leads  its  neighbor- 
column  elements  by  one  step  as  in  most  algorithms  used  in 
systolic  arxays.  The  multiplier  "m"  is  generated  from  the 
diagonal  position  and  transmitted  to  every  column.  Since 
the  ”/"  matrix  has  different  zero-nonzero  topological  struc¬ 
ture  from  the  ~P"  matrix,  the  algorithm  has  two  distinct 
phases  for  ”P"  and  mJ".  “The  T  phase  can  significantly  lag 
the  progress  of  "P"  matrix  in  time,  but  the  operations  are 
essentially  the  same.  A  aimilar  backward  substitDtion  algo¬ 
rithm  dan  be  found  in  [10]. 


4.2.  Other  Matrix  Operations  j 

The  operation  set  bf  BLOSSOM  also  includes:  system 
solution  and  matrix  multiplication  (both  controlled  by  the 
ECU),  submatrix  multiplication,  submatrix  addition,  subma-j 
tnx  norm  computation,  and  submatrix- vector  multiplication! 
(all  controlled  by  the  £U).  Detailed  description  of  these' 
operations  are  listed  in  [10]. 

5.  Concluding •’Kemarfcs 

BLOSSOM  differs  from  other  special  purpose  machines 
by  using  block  IX’  decomposition  and  by  exploiting  matrix 
sparsity  commonly  expected  in  engineering  problems. 
Table  1  lists  the  comparison  of  BLOSSOM  with  other 
hardware  schemes  proposed  in[l2.7.9.8.5],  where  the  com¬ 
plexity  of  0(MnA*)  for  a  LSE  solution  by  BLOSSOM  happens 
in  the  case  of  NxN  BBDF  matrices  under  mild  condi¬ 
tions^].  With  the  architecture  designed  to  exploit  the 
parallelism  of  the  block  LU  decomposition  algorithm,  BLOS- ! 
SOM  is  efficient  in  solving  problems  encountered  in  the  VLSI 
era.  Current  work  on  BLOSSOM  also  involves  its  simulation, 
its  performance  evaluation,  and  the  expansion  of  its  opers- 
tion  set. 
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A  MULTIPROCESSOR  QtPl£UXXTAT10N  OT  RELAXATION-BASKD 
BXCTI9CAL  CIRCUIT  SIMULATION 


].  T.  Deutsch  and  A.  R.  Newton 
Department  of  Electrical  Engineering 
and  Computer  Sciences 
University  of  California.  Berkeley.  94720. 


Afaetract:  The  electrical  circuit  fimulation  of  large 
integrated  circuits  is  very  expensive.  New  relaxation-based 
algorithms  promise  to  reduce  this  coat  by  exploiting  the 
properties  of  large  networks.  However.  this  speed  improve- 
ment  is  not  sufficient  foe  the  cost-effective  analysis  of  very 
large  circuits.  'While  array  processors  have  helped  inprove 
the  per  formance  of  circuit  simulators,  further  improvement 
can  be  achieved  by  the  use  of  special-purpose  multiproces¬ 
sors.  In  paper,  the  implementetion  of  e  relaxation- 
based  circuit  simulation  algorithm,  called  Iterated  Timing 
Analaysis.  on  e  multi-processor  is  described.  Initial  results 
behests  that  this  approach  haa  a  great  deal  or  potential  for 
reducing  the  cost  of  circuit  simulation. 


1.  DfTBODUCTION 

Tbs  use  of  modern  CAD  tools,  in  particular  Connectivity 
Veriflcetion  Systems  (CVS),  in  the  design  of  complex 
integrated  circuits  has  increased  the  probability  that  cir¬ 
cuits  will  work  on  first  silicon!  l],  However,  the  time  lag 
between  a  functional  circuit  and  a  circuit  that  meets  its 
performance  objectives  Is  increasing.  Simple  delay  estima¬ 
tion  techniques  and  cell-level  circuit  simulation  can  be  used 
far  first-order  performance  estimation  in  constrained 
design  methods.  Unfortunately,  these  approaches  do  not 
predict  circuit  performance  accurately  for  state-of-the-art 
circuit  designs.  For  this  reason,  circuit  simulators,  origi¬ 
nally  designed  to  simulate  circuits  containing  under  100 
transistors.  «re  often  used  today  to  simulate  circuits  con¬ 
taining  many  thousands  of  transistors. 

One  of  the  most  common  analyses  performed  by  circuit 
simulators  and  the  most  expensive  in  terms  of  computer 
time  is  nonlinear,  time-domain  transient  analysis.  By  per¬ 
forming  this  analysis,  precise  electrical  waveform  informa¬ 
tion  can  be  obtained  if  the  device  models  and  parasitics  of 
the  circuit  arc  characterized  accurately.  Beceuse  of  the 
need  to  verify  the  performance  of  larger  circuits,  many 
users  have  successfully  simulated  circuits  containing 
thousands  of  transistors  despite  the  coat.  For  example,  a 
TOO  MOSFTT  circuit,  analyzed  for  4us  of  simulated  time  with 
an  average  2ns  time  step,  takes  approximately  4  CPU  hours 
an  a  v*x  unto  VKS  computer  with  floating-point  accelerator 
hardware  using  the  3PICE2  program! 2], 

Cate-level  logic  simulators  (e.g.  [3,4])  and  rwitcb-level 
Simula)  ors(5-7]  can  verify  circuit  function  and  provide  first- 
order  tuning  information  more  than  three  orders  of  magni¬ 
tude  aster  than  e  detailed  circuit  simulator.  However,  to 
verify  circuit  performance  for  critical  paths,  memory 
design,  and  analog  circuit  blocks,  and  to  detect  dc  oircuit 
problems  such  as  noise  margin  errors  or  Incorrect  logic 
thresholds,  it  is  often  essential  to  perform  accurate  electri¬ 
cal  simulation.  In  some  companies  the  simulation  of  circuits 
containing  many  thousands  of  devices  is  performed  rou¬ 


tinely  and  at  great  expense  In  recent  years,  considerable 
effort  has  been  focussed  on  techniques  for  improving  the 
epeed  of  time-domain  electrical  analysis  while  maintaining 
acceptable  waveform  accuracy. 

A  number  of  approaches  have  been  used  to  improve  the 
performance  of  conventional  circuit  simulators  for  the 
analysis  of  large  circuits.  The  time  required  to  evaluate 
complex  device  modal  equations  has  been  reduced  using 
table-lookup  models[S,9].  Techniques  based  on  special- 
purpose  microcode  bave  been  investigated  for  reducing  the 
time  required  to  salve  sparse  linear  systems  arising  from 
the  linearization  of  the  circuit  equations!  10].  Node  tearing 
techniques  have  also  been  used  to  exploit  circuit  regularity 
by  bypassing  the  solution  of  subcircuits  whose  state  is  not 
changing!  11. 12], 

These  techniques,  and  others,  bave  also  been  used  to 
exploit  the  vector  processing  capabilities  of  higb  perfor¬ 
mance  computers  such  es  the  CBAY-l[l3.14]  and  F?S-is«[l5]. 
These  special-purpose  computers  have  additional  hardware 
designed  to  exploit  the  parallelism  and  pipelining  that  is 
available  in  the  programs  they  execute.  Unfortunately,  cir¬ 
cuit  simulation  programs  are  not  well  suited  to  these  com¬ 
puters.  In  particular,  the  sparsity  of  the  circuit  matrix  and 
Its  irregular  structure  cause  the  data  gaihrrscattrr  time  to 
dominate  overall  program  execution  time!  14].  That  is.  sim¬ 
ply  fatching  the  data  stored  in  memory  and  writing  it  back 
out  again  after  it  has  been  processed  becomes  the 
bottleneck.  In  all  cases,  tbe  overall  speed  improvement  of 
the  simulation  has  been  at  most  an  order  of  magnitude,  for 
practical  circuits. 

Recently,  a  new  class  of  algorithms  has  been  applied  to 
the  electrical  1C  simuletion  problem  New  simulators  using 
these  methods  provide  puarenfesd  accuracy! IB]  -  as  accu¬ 
rate  or  more  accurate  waveforms  than  standard  circuit 
simulators  with  up  to  two  orders  of  magnitude  speed 
improvement  for  large  circuits!  17. 19].  These  simulators 
have  been  used  far  the  analysis  of  both  digital  and  analog 
UD8  ICa.  They  use  relaxation  methods  for  tbe  solution  of  the 
set  of  ordinary  differential  equations,  (03Es)  which  describe 
the  circuit  under  analysis,  ratber  than  the  direct,  sparse- 
matrix  methods  on  which  standard  circuit  simulators  are 
based.  While  these  new  algorithms  provide  substantial 
speed  improvements  on  conventional  computers,  they  can 
provide  much  greeter  speedups  on  special-purpose 
hardwere  that  Is  designed  to  exploit  the  particular  features 
of  these  algorithms! 22], 

hi  this  paper,  the  use  of  the  ftirattd  Timing 
(Wsolynii!  19-20]  (ITA)  on  a  special-purpose  multi-processor  is 
presented.  The  ITA  method  Is  an  SOR-Newton.  relaxation- 
based  method  which  uses  event-driven  analysis  and  selec¬ 
tive  treee  to  exploit  tbe  temporal  sparsity  of  the  electrical 
mtwork[2l].  Because  event-driven  selective  trace  tech¬ 
niques  are  employed,  this  algorithm  leads  itself  to  imple¬ 
mentation  on  a  data-driven  computer.  Initial  results  indi¬ 
cate  that  data-driven  multiprocessors,  working  with  a  con¬ 
ventional  bolt,  can  provide  almost  limitless  performance 
improvement  for  electrical  circuit  simulation  This  particu¬ 
lar  class  of  machines  is  also  well-suited  to  other  network- 


graph- based,  event -driven  algorithms,  including  fault  simu¬ 
lation.  layout  compaction,  layout-rule  checking,  and  the  1C 
tape-out  process  Tor  fabrication.  Uany  nonelectrical  prob¬ 
lems  also  fit  this  modrL 

A  prototype  simulator  has  bean  implemented  on  an 
existing  multiprocessor  and  experimental  results  from  this 
simulator  ore  presented. 

Throughout  the  paper,  a  common  circuit  example  is 
used.  It  is  on  industrial  VMOS  digital  Alter  circuit  vhose 
netlist  and  parasitic  capacitor  values  were  obtained  from  an 
analysis  af  the  mask  layout.  The  circuit  contains  396 

electrical  nodes  and  898  MOSFETS,  of  which  approximately 
31%  are  depletion  loads,  42%  are  driver  devices  (either 
source  or  drain  node  connected  to  ground),  and  the  remain¬ 
ing  27%  are  transfer  gates  (all  three  terminals  are  con¬ 
nected  to  signal  nodes  or  clocks). 

In  Section  2.  tbe  ITA  algorithm  is  reviewed.  In  Section 
3,  the  implementation  of  ITA  on  a  multiprocessor  is 
described  and  the  MSPUCE  program  is  introduced.  Section 
3  also  includes  the  results  of  early  simulations  of  the  inter¬ 
connection  network  for  thie  problem  and  some  extrapola¬ 
tions  to  more  advanced  machines.  Tbe  prototype  multipro¬ 
cessor  used  to  gather  experimental  data  ia  described  in 
Section  4  and  the  data  obtained  from  the  experimental 
implementation  is  presented. 


and  an  iterative  relaxation  method  (Gauss-.'acobi  or  Ga-iss- 
Seidel  on  a  uniprocessor-)  is  then  used  to  solve  them  How¬ 
ever,  unlike  timing  analysis  where  a  single  relaxaLon  itera¬ 
tion  is  used  per  time-point,  in  the  ITA  approach  the  relara 
turn  process  is  zantmlri  to  convrrymct  at  a  time-point 

Only  one  Newton-Raptnon  iteration  is  used  to  approxi¬ 
mate  the  solution  of  each  nodal  equation  per  relaxation 
Iteration  and  event-driven,  selective  trace  techniques  may 
still  be  used  to  exploit  latency,  as  for  timing  simulation. 
Since  in  ITA  the  nonlinear  circuit  equations  are  solved  by  an 
iterative  method  until  satist-  ‘ory  convergence  is  achieved, 
the  numerical  properties  of  the  integration  methods  used  to 
discretixe  the  circuit  equations  are  retained.  Thus,  the  sta¬ 
bility  and  the  accuracy  problems  typical  of  the  timing  simu¬ 
lation  algorithms  are  not  an  issue  here[l6]. 

The  following  algorithm,  written  in  "Pidgin  ’C'"[24]. 
illustrates  the  principle  steps  involved  in  iTA  analysis,  using  a 
Gauss- Seidel  iteration,  for  use  on  a  conventional  computer 
At  each  time  at  which  one  or  more  nodes  are  scheduled  to 
b«  processed,  two  event  lists.  E^ft*)  and  f„  )  are  used 
to  separate  the  nodes  to  be  processed  in  successive  itera¬ 
tions,  k  and  k  *  1.  of  the  Causs-Seidel-Newton  process. 

Qaua-Sndal-Nrurton  /feral ion 

put  all  nodes  that  are  connected  to  independent  sources 
in  event  list  E+( 0); 


S.  ITCRATB)  THONG  ANALJTSS 

The  ITA  method  is  a  new  form  c,(  electrical  analysis 
which  can  be  derived  from  timing  simulation^].  This  form 
of  relaxation-hased  electrical  analysis  has  shown  promising 
results  over  a  wide  class  af  circuits,  from  large  digital  cir¬ 
cuits  to  complex  analqg  designs(lS].  The  SPLICE l  program. 
which  employs  ITA  for  electrical  analysis,  is  now  being  used 
successfully  at  a  number  c  f  industrial  sites. 

The  starting  point  far  a  description  of  TTA  is  the  electri¬ 
cal  circuit  equation  formulation.  A  Nodal  Anoiysis[23]  for¬ 
mulation  will  be  used  to  illustrate  the  ITA  algorithms.  Under 
the  assumptions; 

•  All  resistive  elements,  including  active  devices,  are 
characterised  by  constitutive  equations  where  voltages 
are  the  controller^  variables  and  currents  are  the  con¬ 
trolled  variables. 

•  All  energy  storage  elements  are  two-terminal,  possibly 
nonlinear,  voltage-controlled  capacitors. 

•  All  independent  voltage  sources  have  one  terminal  con¬ 
nected  to  ground  or  con  be  transformed  into  indepen¬ 
dent  current  sources  with  the  use  of  tbe  Norton 
transformation. 

the  nodal  network  equations,  where  there  are  N  equa'ions 
in  N  unknown  node  voltages,  N*  1  nodes  in  the  circuit,  and 
node  N 41  is  the  reference  node,  or  ground,  can  be  written; 

Cfv.Mjt  *  -J(v.u)  (fcl) 

v(0)a  V. 

yhere  v  (t )  C  VP  is  the  vector  of  node  voltages  at  time  t. 
v(l)C  Ff  ia  the  vector  of  time  derivatives  of  v(t). 

is  the  Input  vector  at  time  f,  €(•) ;  FT  -FV* 
represents  the  nodal  capacitance  matrht,  /  :FTxIP-»F", 
and: 

where  ).  u(t ))  is  the  sum  of  the  currents  charging 

the  capacitors  connected  to  node  i.  The  differential  equa¬ 
tions  are  converted  to  a  set  of  nonlinear,  algebraic 
differenca  equations  using  a  lUfAy-stabie  integration  for¬ 
mula  to  give; 

p(*)  ■  0  (12) 

where  x  C  R*  is  the  vector  of  node  voltages  at  time  l,,,. 


*m  ~  O. 

while  ( t+KTSTOP  )  ( 

*•-0; 

while  (event  list  Et(t.)  is  not  empty  )  } 


i(f  hxE,(tm))\ 
******* 


»,_-q 


t,( 

Where  v*M-‘  *  [v^M.  •  •  •  .  I*  **.  i*.j.  nj]r 

If  (  |uf  i.e.  convergence  is  achieved)  { 

use  IT?  to  determine  the  next  time,  f,.  for  processing  node  i: 
add  nodes  to  event  list  E»(t,h 

odd  node  i  to  event  list  Emit*)-. 

add  the  fanout  nodes  of  node  i  to  event  list  Et(tn) 
tf  they  are  not  already  on  Et(tn  )\ 


Ea(t%)*-Et.i(tn},  )*■  empty  ; 

1 

fm  v  li 

I 

where  t,  is  the  present  time  for  processing  and  t. ,  |  is  the 
next  time  in  the  time  queue  at  which  an  event  was 
scheduled.  In  this  way,  the  "time-step"  is  handled  indepen¬ 
dently  for  each  node.  The  fcreacb  construct  requires  tr.at 
the  block  he  executed  for  each  member  of  the  set  in  a 
specified  order. 

This  simplified  algorithm  does  not  illustrate  how  such 
Issues  os  time-step  reduction  and  local  t  rune  at  ion -error 
estimation  are  handled.  These  and  other  important  detaus 
of  the  algorithm  are  described  eisewhere[20j.  Whue  a  nodal 
formulation  was  used  to  describe  tbe  approach,  a  Modified 
Nodal  formulation] 25]  can  also  be  derived 


As  mentioned  earlier,  the  TTA  method  haa  guaranteed 
convergence  1.r''>ertiex.  However,  Tor  tightly-coupled  por¬ 
tion!  of  a  large  circuit  (eg.  active  UOS  transmission  gate 
tree*  or  an  Operational  Amplifier  in  iruitj  gain 
corfigiration)  the  convergence  rate  may  be  relij-vely  alow. 
Tor  tbaee  sub-circuits.  a  direct  approach  will  usually 
improve  the  overall  performance  of  the  simulation.  A 
modification  of  the  above  algorithm  which  replaces  the  sin¬ 
gle  circuit  node  with  an  entire  subcircuit,  described  by  a 
wdal  admittance  matrix,  has  been  used  successfully  in 
relaxation-based  simulators  to  improve  convergence  rate 
tar  circuits  of  this  type[2l][17].  Applying  TTA  to  subcircuits 
rather  then  single  electrical  nodes  has  important  implica¬ 
tions  in  the  multi-processor  case.  Tn  particular,  it  permits  a 
tradeoff  between  the  amount  of  time  spent  on  any  single 
processor  at  an  iteration  and  the  time  spent  communicating 
the  results  of  tha  analysis;  a  key  requirement  for  deriving 
maximum  efficiency  from  the  multiprocessor,  as  will  be 
seen  later. 

3.  nfflMPfTATIOH  or  TEA  OH  A  lfULTIPBOCESCR 
8.1  Introduction 

For  the  purpose  of  this  description,  a  multiprocessor 
vein  be  defined  as  a  collection  of  Mr  processor-memory  ele¬ 
ments  (PUIls)  which  communicate  with  one  another  via  an 
interconnection  network  (1CN),  as  illustrated  in  fig.  81. 
Each  processor  can  reference  data  stored  in  its  local 
memory  (fM  time  mits/reference)  or.  via  tha  1CN.  it  can 
reference  data  stored  in  the  memory  of  a  remote  PUE  (t,*, 
time  units/reference.  *  *  f*. ).  While  for  some  lCNs  Im 

is  a  strong  function  of  the  relative  positions  of  the  FUEs  on 
the  network,  for  our  purposes  it  is  assumed  that  U  is  a 
constant,  iqiper  bound  on  remote  reference  time,  for  net¬ 
works  with  non-uniform  routing  distance,  exploiting  locality 
of  reference  is  then  an  optimization  that  may  be  applied 
later. 


Flf.3.1  A  Multiprocessor  Confifurstion 


3.2  iwestiejiiwg  tha  Problem 

There  ere  a  number  of  ways  that  an  algorithm  such  as 
TTA  can  be  partitioned  for  implementation  on  a  multiproces¬ 
sor.  The  two  basic  partitioning  techniques  are; 

(1)  fUstlooit  Partitioning:  A  conventional  process-based 
approach;  allocate  distinct  functions  se  processes  to 
each  processor  either  dynamically  or  statically.  Three 
functions  can  be  allocated  at  the  instruction  level  for  a 
fine-grained  approach,  such  as  data-flow  [26],  or  the 
functions  can  be  course-grains d.  such  aa  allocation  at 
event  scheduling  to  one  processor,  UOSTET  procsssing 
to  another,  current  summing  to  a  third,  and  so  on. 


(2)  Bata  PwttUooing:  In  this  case,  each  processor  per¬ 
forms  a  similar  set  of  functions  but  on  different  data 
Items  which  are  allocated  either  dynamically  or  stati¬ 
cally  to  each  processor.  In  the  electrical  simulation 
oaaa,  this  might  correspond  to  allocating  the  evaluation 
of  different  collections  of  transistors  to  each  processor 
while  the  steps  performed  to  evaluate  the  transistors 
wra  common  to  all  processors. 

Unfortunately,  course-grain  functional  partitioning 
does  not  lend  Itself  to  uniform  growth  —  adding  more  PUEs 
may  not  lead  to  improved  performance  unless  the  functional 
units  are  connected  by  a  flexible  interconnection  network, 
wnd  the  system  Is  designed  so  that  multiple  copies  of  criti¬ 
cal  functions  can  be  utilized  effectively.  This  problem  can 
often  he  solved  by  re-architecting  the  allocation  of  functions 
to  processors  but  that  ia  an  expensive  task  and  to  be 
avoided  if  possible. 

The  approach  described  in  this  paper  is  based  on  data 
partitioning.  Here  a  single  sub-circuit  described  using  an 
MHA  matrix  is  allocated  to  each  processor.  To  simplify  the 
description  of  the  algorithm,  it  ia  assumed  that  each  sub¬ 
circuit  consists  of  a  single  electrical  node.  It  is  also 
assumed  for  now  that  1^,  does  not  depend  on  network 
loading  conditions.  Consider  the  circuit  fragment  shown  in 
82(a)  and  the  multiprocessor  sbown  in  Fig.  3.2(b).  The 
•  nodes  and  their  associated  fanin  elements  have  been  allo¬ 
cated  to  specific  PUEs.  For  this  example,  it  is  assumed  the 
alloc ali on  is  statio  however  nodes  could  migrate  to  free 
Plffj  dynamically,  as  described  later.  Note  that  there  are 
more  circuit  nodes  than  PUEs.  which  is  usually  the  case  and 
is  required  to  obtain  maximum  efficiency.  Therefore,  each 
PUT  i*  responsible  for  processing  more  than  one  electrical 
node;  a  separata  event  scheduler  can  be  implemented  on 
each  PUE  for  handling  this  situation  or  a  process-level  solu¬ 
tion  may  be  achieved  by  implementing  virtual  PUEs[27J. 


Pig.  3.2(a)  Circuit  fragment 


714.9  2(b)  Allocation  of  nodes 
to  PMZa  on  the  ICN 


There  are  two  principle  ways  in  which  the  processor 
activity  can  be  coordinated.  We  have  categorized  them  as 
explicit  methods,  where  a  central  scheduler  Is  Implemented 
an  «  single  processor  and  coordinates  the  equation  solution 
on  tbs  other  processors,  and  methods,  where  the 

scheduling  is  distributed  and  performed  astynchronously. 
The  method  we  have  implemented  at  this  time  Is  an  implicit 
approach.  A  single  giohal  variable,  called  Qobol/Ssmmrdnp- 
fitts,  is  used  to  coordinate  the  processors  st  s  given  time 
point.  It  is  incremented  whenever  s  node  Is  scheduled  st 
this  time  point  and  is  decremented  when  s  node  has  finished 
being  processed.  When  QabalRwirvnrdagNet*  reaches  zero, 
all  processors  more  to  the  next  time  point  of  the  simulation. 
fYom  the  point  of  view  of  a  single  PME.  P.  occe  it  has  been 
allocated  a  set  of  electrical  nodes,  If,  it  proceeds  as  follows 
at  time  t»: 


foreerh  (  node  i  in  If  scheduled  at  ^  )  j 
/•STEP  (!):•/ 


tareach  (/odn  element  at<) 

obtain  its  /atsn  node  voltages,  vf.  jri.K*k  orfc  +  l; 

/•STEP  (2):*/ 

twvach  (/onan  element  at  t  ) 

compute  its  contributions  to  nodal  equation; 

obtain  vf  *  *  using  a  single  Newton-Raphaon  step 
as  described  in  Section  2; 

If  (  convergence  is  achieved  )  J 

schedule  i  at  l»  . 

daj  aaant  OotaiPtmairangffeta; 


schedule  i  again  at  t„; 

forwll  (  /snout  nodes  of  i  )  | 
Increment  OLobalRrmatrcm^Nttr. 
send  message  to  tbeir  PME  to 
schedule  fanout  node  at  t,; 

I 


Fanin  etements  of  t  are  circuit  elements  (transistors,  capa¬ 
citors,  voltage  sources,  logic  gates,  etc.)  which  are  used  to 
determine  the  new  voltage  at  i,  as  illustrated  in  Fig  3.1  On 
average,  there  are  7)f jj  fanin  elements  per  node.  To  pro¬ 
cess  each  fanin  element,  it  is  necessarr  to  obtain  its  con¬ 
trolling  node,  or  /mdn  node,  voltages,  vf.  Assume  there  are, 
an  average,  Nfn  fanin  node  voltages  that  must  be  obtained 
per  node  iteration.  Portend  nodes  of  i  are  defined  as  nodes 
with  at  least  one  fanin  element  connected  to  node  i.  There 
are  an  average  of  Rfot  fanout  nodes  per  node.  Of  course, 
voltage  supplies,  clocks,  and  the  ground  node  are  not*  con¬ 
sidered  fanout  nodes  since  they  so  not  represent  indepen- 
dent  node  voltages. 

Using  our  simple  model,  the  values  of  the  controlling 
node  voltages  Vf  will  require  s  local  memory  reference  per 
node  If  the  node  resides  on  the  seme  PME  os  node  f ,  other¬ 
wise  it  will  require  remote  memory  references,  hi  the  worst 
case,  ell  fanin  element  nodes  will  reside  on  remote  PMEs  and 
ell  memory  references  will  require  1^*  units  of  time. 
Assuming  only  one  remote  memory  reference  can  be  active 
at  any  time  for  a  particular  PME.  the  average  Uma  taken  for 


fanin  ,r 

nodes  _  HI 

•  *  f=^~i  1  '  •  f*J»out 

fanout  'h  / 

node  (1)  #' 

K«.3.3  Circuit  fragment  ihowlnf 
fanin  and  fanout  nodes 


9tep  (1).  tt,  can  be  approsimated  hy 

(11) 

Stop  (2)  does  not  require  the  processor  to  wait  for  a 
remote  answer  and  hence  the  time  taken  in  Step  (2) 
depends  only  on  the  performance  of  the  PUE.  not  the  iCN. 
The  time  required  to  solve  a  single  nodal  equation  is  propor¬ 
tional  to  the  number  of  fanin  elements,  since  each  one  must 
be  processed  for  the  Newton-Raphson  step.  In  fact,  the  pro¬ 
cessing  of  each  transistor.  tM.  dominates  the  PUE  time, 
with  a  small  amount  of  time.  for  checking  conver- 

Sacw.  updating  local  memory,  etc.  Tbs  time  required  for 
sp  (2)  can  be  written: 

(3.2) 

fbr  UOS  or  Bipolar  circuits,  where  each  transistor  has 

JOjyw 

three  controlling  terminal  voltages.  -= - =  2.  If  supply  vol- 

Nfu 

tagea  and  ground  are  considered  as  special-case  nodes  and 
do  not  require  remote  reference,  analysis  of  a  number  of 

large  UOS  circuits  indicates  that  0.5  sc  2.  Typically 

srPf  la  1.2  for  NUOS  circuits  and  is  1.5  for  CMOS  circuits. 

j, 

Tor  the  example  circuit  -= - is  1.18.  The  importance  of 

NfUt 

this  result  will  become  clear  in  later  in  this  section. 

One  measure  of  tbs  performance  of  a  multiprocessor  is 
Us  efficiency,  rj( Np  ),  where: 

TrPyjrr  <13> 


where  1^,  (S.)  iM  the  "wall  dock"  simulation  time  using  V. 
PMEs. 

Even  on  an  ideal  multiprocessor  It  is  not  possible  to 
achieve  an  efficiency  of  1.0  unless  there  are  sufficient  active 
electrical  nodes  available  to  keep  all  the  PUEs  busy  at  all 
times. 

3.3  Ideal  Models 

Cohsider  an  fdtol  Oceuir-Sridil  tfulhproc  error  where 
f  i  *  0  (the  iCN  is  infinitely  fast)  and  («  is  constant  for  all 
electrical  nodes.  Such  a  machine  also  schedules  nodes  to 
PUEs  using  an  optimal  dynamic  scheduling  technique  in 
aaro  time.  Simulation  of  such  a  machine  for  our  example 
circuit  provides  the  following  upper  bounds  for  efficiency 
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a 


% 

speedup 

V(N,) 

1 

1.00 

1.00 

2 

1.97 

0.98 

3 

2.92 

0.97 

4 

3.81 

0.95 

5 

4.80 

0.96 

6 

3.83 

0.94 

7 

6.51 

0.93 

8 

7.32 

0.92 

9 

8.08 

0.90 

10 

8.86 

0.89 

11 

9.36 

0.87 

12 

10.26 

a  85 

13 

10  93 

0.84 

14 

11.65 

0.83 

13 

12.28 

0.82 

18 

12.95 

0.81 

32 

20.48 

0.84 

84 

28.47 

0.44 

128 

34.18 

0.27 

Note  that  beyond  32  p roc ei ion  ten  than  half  the  mas* 
imum  performance  of  the  machine  can  actually  be  achieved 
for  this  circuit. 

A  more  realistic  model  for  multiprocesaoi^based  simu¬ 
lation  would  assume  a  constant  delay  in  the  1CN  such  that 
With  e  random,  static  assignment  of  nodes  to 
PKEs.  the  efficiency  of  such  a-multiprocessor  for  the  exam¬ 
ple  circuit  is: 


", 

speedup 

V(N,) 

1 

1.00 

1.00 

2 

1.94 

0.97 

3 

2.81 

0.94 

4 

3.71 

0.93 

3 

4.53 

0.91 

6 

5.44 

0.91 

7 

8.13 

0.86 

8 

8.78 

0.65 

9 

7.45 

0.83 

10 

8.14 

0.81 

11 

8. 82 

0.80 

12 

9.35 

0.78 

13 

9.84 

0.78 

14 

10.36 

0.74 

15 

10.86 

0.73 

18 

11.64 

0.73 

The  new  model  results  shows  a  reduction  in  efficiency.  How¬ 
ever,  the  overall  efficiency  is  still  very  high 

To  obtain  maximum  efficiency.  It  is  necessary  to  match 
the  size  and  activity  of  the  circuit  under  analysis  to  the 
number  of  processors  it  uses,  empirical  results  provide  the 
best  guidelines  for  such  allocation  on  a  real  multiprocessor. 
Note  that  since  the  analysis  is  decoupled  via  the  TTA  algo¬ 
rithm.  many  independent  circuits  can  be  simulated  at  the 
seme  Urns  on  the  aame  multiprocessor.  Pf  running  a 
number  of  circuits,  on  a  targe  number  of  processors,  the 
overall  efficiency  can  be  kept  high. 

X4  Allocation  of  Nodes  to  nfZe 

Nodes  can  be  allocated  to  PlfEi  either  statically,  before 
the  analysis,  or  they  can  be  allocated  dynamically  as  the 
simulation  proceeds.  In  either  case.  It  is  important  that  the 
time  taken  for  determining  the  allocation  is  small  otherwise 
it  may  dominate  the  total  simulation  time. 


In  the  case  of  static  allocation,  a  number  of  strategies 
■re  possible.  These  include: 


<1)  Ban  (Van  Allocation,  where  the  nodes  are  allocated  to 
processors  in  random  order  while  maintaining  approxi¬ 
mately  the  same  number  of  nodes  on  each  processor. 

(B)  "ftTaoa  Allocation,  where  nodes  that  are  connected  in  a 
aerial  path  (i.e.  a  fanout  element  of  one  node  is  a  fanin 
element  of  the  next)  are  stored  on  the  same  processor 
to  minimize  the  number  of  off-processor  memory  refer¬ 
ences.  Whenever  more  then  one  fanout  element  is 
present,  the  other  fanout  nodes  are  allocated  to 
different  processors. 

(3)  Minimum  Distance  Allocation  where,  for  ICS's  where 
tin  varies  depending  on  where  the  remote  processor  is 
located  on  the  network,  adjacent  electrical  nodes  are 
placed  an  adjacent  processors  on  the  1CN  if  possible. 

Of  these  strategic*.  (1)  involves  the  minimum  amount 
of  setup  time. 

fa]  the  optimal  dynamic  allocation  scheme,  nodes  are 
allocated  dynamically  to  processors  to  keep  the  load  on  all 
processors  balanced.  Note  that  this  requires  additional 
remote  memory  references  or  movement  of  the  circuit 
description  data  as  the  simulation  proceeds. 

For  the  example  circuit,  the  difference  In  efficiency 
between  a  random,  static  allocation  and  optimal  dynamic 
allocation  was  lees  than  10%  in  almost  all  cases.  For  this 
reason,  the  overhead  associated  with  more  sophisticated 
allocation  strategies  may  render  them  less  efficient  overall. 
The  results  reported  in  Section  4  are  obtained  from  a  ran¬ 
dom  allocation  strategy. 

3.3  Circuit  Partitioning 

As  mentioned  earlier,  circuits  can  be  partitioned  into 
sub-circuits,  where  the  elements  of  each  sub-circuit  are 
strongly  connected.  This  partitioning  serves  two  purposes. 
First,  it  permits  direct  methods  to  be  used  for  the  tightly- 
ooupled  portions  of  the  circuit,  resulting  in  a  more  efficient 
analysis.  Since  the  sub-circuits  rrs  highly  connected,  they 
can  be  processed  usii^  full  matrix  techniques  which  are 
amenable  to  speed-up  via  pipelining. 

Second,  fay  allocating  an  entire  sub-circuit  to  each  pro¬ 
cessor  the  amount  of  computation  performed  locally  can  be 
adjusted  to  balance  PME  end  ICN  loads. 

3.0  Cbde*  d  ICN 

Many  different  intercommunication  networks  have  been 
develops  d( 28, 29]  but  only  a  few  meet  the  requirements  of 
this  application.  In  particular.  ICNs  based  on  ti.s  Perfect 
Shuffle  coonection[30,3l]  have  the  following  desirable  pro¬ 
perties: 

(1)  Low  latency,  which  grows  as  the  log  of  the  number  of 
porta  in  the  ICN. 

(2)  switch  element  Mae.  The  Dumber  of  ports  on 
aacb  switch  element  is  independent  of  the  size  of  the 
ICN. 

(3)  Uniform  Loading.  All  network  elements  see  a  similar 
load. 

(4)  bay  Oowth.  Only  wiring  changes  and  identical 
switches  ere  necessary  to  increase  the  number  of  PMEs 
oo  the  multiprocessor. 

(5)  Staple  muting  algorithm.  All  routing  decisions  can  be 
made  on  the  basis  of  local  information  end  the  routing 
process  is  very  fast 

Because  of  the  high  performance  of  ■  Shuffle-based  ICN 
high  electrical  simulation  efficiencies  can  be  achieved.  For 
a  large  circuit  containing  N  electrical  nodes  with  C  nodes 
actively  changing  at  any  time.C*Af,  the  total  time  spent 
•oMng  the  independent  node  equations  on  a  serial  com¬ 
puter  is  approximately: 


^  ■: i&JM. 


S' 


T,(C)-C  (tmM*NratMi)  (3.4) 

Consider  a  multiprocessor  using  a  single  or  multiple  stage 
Shuffle  network  with  an  element  cycle  time  of  te  and  a 
latency  fc  log  (Nr  ).  where  Sp  is  the  number  of  PliEs  con¬ 
nected  via  the  shuffle  network  and  k  ii  a  constant.  Then 
t„m  »  fe  k  \°t(Hp  ).  For  now  assume  Nr  >  C,  then  the  total 
analysis  time  on  such  a  network  is  approximately: 

*  iw—  )*  Urn,  tck  log (N, )  (3.5) 

Eqn.  (3.5)  is  in  fact  a  worst-case  figure  because  it  assumes 
all  communication  is  on  the  critical  path  of  the  computation 
and  that  there  is  no  pipelining  of  requests.  The  speed-up 
factor  far  the  parallel  computation  is  then: 

T,  m _ C  (t*+4  *  ^mtrrni  ) _  ,43> 

Tp  + Nmt„*  )  ■+  Nm  tc  k  \ot(tfr) 

If  k  is  from  1  to  3(31):  if  C  *  Sp;  with  12-1.5  as 

“/n 

shown  above  and  f  negligible,  then  If  Inm  »  *•»* .  the 

C 

speed-up  becomes  approximately  However,  if  the 

1  ▼  logc 

equation  solution  time  Is  larger  than  the  network  cycle  time ; 
then  the  speed-up  factor  will  be  closer  to  C,  as  is  demon¬ 
strated  in  Section  4. 

3.7  The  HSPUCE  program 

The  USPliCE  program  was  developed  to  verify  the 
above  model.  It  uses  the  Causs-Seidel-Newton  algorithm 
described  earlier,  but  can  also  bw  run  using  a  less  con¬ 
strained  variation  of  the  algorithm  known  as  usakfy  chaotic 
relaxation.  In  this  case,  a  PUD  can  continue  to  solve  a  node 
far  iterations  k>l.k>2.  •  •  •  k*nn,  where  Fm  is  the  max¬ 
imum  number  of  iterations  a  node  can  move  ahead  before  it 
must  wait  for  updates  of  the  values  of  ita  fanin  node  vol¬ 
tages.  For  Causs-Seidel-Newton.  Fm  *  1. 

The  algorithm  used  in  the  M  SPLICE  program  proceeds 
as  follows: 
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MSPUCE  has  l..ci.  Implemented  on  both  the  Digital 
VAX  11-730  and  the  the  BBN  Butterfly  machine. 


<  sxpwaMXKTAL  MPLntDrrxnoN 


4.11b*  Huilwu  In  order  to  evaluate  the  performance  of 
MSPLICE.  it  was  necessary  to  me  a  real  multiprocessor.  The 
BBN  Butterfly[32,33]  multiprocessor  was  cboaen  for  these 
experiments  as  it  was  the  only  machine  available  whose 
characteristics  approximate  those  described  above. 

Tbs  BBN  Butterfly  is  a  tightly-coupled  multiprocessor.  An 
entire  system  may  contain  from  1  to  256  processor  nodes. 
Each  processor  node  is  itself  a  multiprocessor,  consisting  of  a 
Motorola  08000  and  an  AMD  2901-bit-slice-based  Processor 
Node  Controller  (PNC).  Together,  these  processors  can  pro¬ 
vide  from  1/2- 1/3  the  integer  performance  of  eDBC  vwc-n/reo 
ca  certain  examples  Once  the  high-performance  floating¬ 
point  co-processor  board  that  we  are  developing  for  the 
machine  has  been  completed,  floating-point  performance  In 
the  same 'range  is  expected.  At  this  time,  however,  floating¬ 
point  is  implemented  at  the  assembler  level  an  is  relatively 
alow. 

All  memory  in  the  system  is  associated  with  the  proces¬ 
sor  boards.  Each  processor  board  contains  256X  bytes  of 
memory,  with  an  optional  memory  expansion  interface  cape- 
fate  at  holding  4M  bytes 

The  computation  and  1/0  processors  are  connected  hy  e 
network  consisting  of  Iag*n  stages  at  shuffle-exchange.  This 
network  is  also  known  as  a  Base  4  Omega  Network.  When  the 
■8000  attempt!  to  read  or  writs  a  virtual  address  which 
represents  memory  on  a  remote  processor  board,  tba  local 
2901  communications  co-processor,  and  the  2901  communi¬ 
cations  co-processor  on  the  remote  board  work  together  to 
read  or  write  the  correct  location.  Together,  the  PMC's  and 
tbs  switch  provide  transparent  remote  memory  access 
across  the  entire  machine  with  uniform  low  latency  and  high 
bandwidth,  local  memory  reference  are  completed  in  B29ne. 
angle  word  read  or  writes  from  any  processor  to  spy  remote 
memory  are  normally  completed  In  laider  Vie  on  a  16  pro¬ 
cessor  configuration,  and  under  5 p*  in  a  236  processor 


configuration.  This  results  in 


,  which  is  comparable  to 


tbs  ratio  of  a  cache  miss  to  a  cache  hit  on  a  uniprocessor. 
The  low  cost  at  remote  memory  references  makes  the 
Butterfly  attractive  for  our  application. 


Block  transfers  are  performed  at  a  rate  at  32 
Mblta/second.  Because  the  structure  of  the  Butterfly  reitch 
floca  not  correspond  to  that  of  a  fully-connected  graph,  it 
cannot  provide  every  possible  pattern  at  connections  between 
inputs  and  outputs  in  a  single  operation.  However,  because  of 
the  high  performance  of  the  twitch  and  the  statistical  fre¬ 
quency  of  conflicting  requests,  this  does  not  appear  to  he  e 
problem,  and  a  simple  hack  off-retry  scheme  is  used  when¬ 
ever  an  access  conflict  occurs  inside  the  switch. 


42  The  Programming  Environment 

The  operating  system  for  the  Butterfly,  called 
Chry«alis[34j.  is  a  simple  operating  system  written  in  the  ’C 
programming  language{3Sj.  Access  to  shared  software 
resources  In  the  machine  is  provided  by  ohjaet  Aerufla*  which 
are  global  Identifiers  and  are  unique  throughout  the  machine. 
Whenever  a  process  needs  access  to  such  e  resource.  It  use* 
Chrysalis  to  map  the  object  Into  its  virtual  address  spice. 

Chrysalis  is  oriented  around  the  concept  of  process-level 
concurrency  and  provides  segmentation-based  virtual 
memory  management,  and  the  ability  to  create,  destroy 
processes.  Process  images  are  loaded  dynamically  on 
demand  to  minimise  the  amount  of  communication  with  the 
host  system. 

Dual  queues  are  also  provided  to  allow  efficient  locking  of 
resources  and  passing  of  data  between  processes  without 
explicit  locking. 


4.3  ■  im»t.l  Results 

The  results  reported  m  this  section  were  obtained  on  a 
ten-processor  machine.  Tor  the  test  circuit,  the  follow'-ig 
speed-up  and  efficiency  values  were  obtained  usi.ig  MSPLiCE 


% 

speedup 

n(H,) 

1 

LOO 

1.00 

2 

1.83 

0.92 

3 

Z2S 

0.78 

4 

3.40 

0.85 

5 

4.09 

0.82 

8 

4.55 

0.78 

7 

5.33 

0.78 

8 

5.98 

0.75 

9 

8.77 

0.75 

10 

7.04 

0.70  j 

Note  that  Tor  nine  processors  the  speedup  and  efficiency  of 
MSPLICE  on  the  Butterfly  was  90%  of  the  maximum  possible 
for  ihi«  example,  as  presented  in  Section  3. 

The  overall  run  time  of  MSPLICE  it  reduced  due  to  the 
pnor  floating  point  performance  of  the  Butterfly.  In  fact,  with 
r,  .-1  Pmm  on  the  Butterfly  for  the  simple  MOS  model  used  in 

MSPliCE,  " 1 — 400.  However.  fr  s  for  the  same  model  on  a 

VAX-1 1/780  with  FPA  is  approximately  300ps.  On  the  other 
bind,  taatf  for  e  modern  MOS  modeL  which  considers  short 
channel  and  other  complex  effects,  on  the  VAX  is  approxi¬ 
mately  Sms.  Therefore,  for  such  an  MOS  model,  even  with 
50MIP  processors  on  the  Butterfly  nodes  ‘mm t  >10  trm*  and 
high  efficiency  will  be  maintained. 

A  conservative  estimate,  with  10M1P  PMEs.  N,  a  256. 
vj/256/  ■  0.5  for  a  23.000  node  circuit  (-  70.000  MOS  PETS), 
results  in  1.3GIP  performance.  In  other  words,  the  analysis  of 
a  70,000  M0SFET  circuit  on  this  processor  would  take  about 
the  same  time  os  the  same  analysis  of  a  50  M0SFET  circuit  on 
a  VAX-  1I/7B0. 

8.  SUMMARY 


The  implementation  of  e  new  form  of  relaxation-based 
electrical  circuit  simulation  for  a  special-purpose  multi¬ 
processor  has  been  presented.  This  technique  promises  to 
Improve  the  speed  of  accurate,  electrical  circuit  simulation 
dramatically  while  reducing  the  overall  cost  of  the  analysis. 

An  implicit  algorithm  for  the  implementation  of  the  ITA 
simulation  method  bas  been  described  and  experimental 
results  from  its  use  on  the  Butterfly  multiprocessor  have 
been  presented.  This  data  confirms  our  analysis  that  the 
use  of  e  high-performance.  Perfect  Shuffle  based  intercom¬ 
munication  network,  in  conjunction  with  the  distributed  ITA 
algorithm,  results  in  a  highly  efficient  utilization  of  the  mul¬ 
tiprocessor  hardware. 

While  the  use  ot  these  techniques  for  circuit  simulation 
bas  been  presented,  the  approach  described  here  may  be 
applied  to  other  electrical  network  graph-based  problems 
with  similar  ^eed  Improvement  characteristics. 
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Abstract 

TimberWolf  ia  an  integrated  set  of  placement  and  routing 
optimization  programs  The  general  combinatorial  optimization 
technique  known  as  simulated  annealing  [1]  is  used  by  each  pro¬ 
gram.  Programs  for  gate  array,  standard  cell,  and  macro/custom 
cell  placement,  as  well  os  standard  cell  global  routing  have  baer. 
developed.  Experimental  results  on  industrial  circuits  show  that 
area  savings  over  existing  layout  programs  ranging  from  IS  to  4C% 
are  possible 


Timber* olf  is  an  integrated  set  of  placeman!  and  routing 
optimization  programs  The  general  combinatorial  optimization 
technique  known  as  simulated  annealing  [1]  is  used  by  each  pro¬ 
gram  Four  basic  optimization  programs  of  the  TimberWolf  pack- 
age  hove  been  developed 

(1)  A  standard-cell  placement  program  This  program  places 
standard  cells  into  rows  and/or  columns  in  addition  to  allowing 
u»er-specif,ed  macro  blacks  and  pads  The  program  was  interfaced 
to  the  CIPAR  standard  cell  placement  package  developed  by  Ameri¬ 
can  Microsystems.  Inc  For  larger  circuits  ( 80C  to  1500  cells ), 
TimberWolf  reduced  total  wira  length  from  45  to  57%  in  comparison 
with  CIPAR  alone.  Furthermore,  final  chip  areas  were  raduced  by 
at  least  3C%.  as  a  result  of  the  improved  placement.  For  a  circuit 
of  100C  cells.  TimberWolf  reduced  the  final  chip  area  by  31%  Ir. 
comparison  to  CIPAR  and  by  21%  over  another  commercially  avail¬ 
able  standard  cell  placement  program 

(2)  A  standard  cell  global  router  program.  The  global  router 
reduced  by  10  to  15%  the  number  of  wiring  tracks  used  by  the 
CIPAR  router  This  translated  to  an  overall  area  savings  of  5  to  7% 
Vecchi  ar.d  Kirkpatrick  [2]  recently  described  the  use  of  simulated 
annealing  for  global  routing 

(3)  A  generalized  gale-array  placement  program  allowing 
user-specified  macros  and  primary  terminals  Thu  program  found 
placements  with  a  6  to  27%  reduction  ir.  total  astimated  wire 
length  for  several  benchmark  problems  in  comparison  to  the  bast 
published  results  This  program  optionally  includes  In  the  cost  cal¬ 
culation  a  measure  of  the  local  routing  congestion. 

(4)  A  macro/custom  cell  placement  program.  This  program 
placas  cells  of  ar.v  rectilinear  shape  Furthermore,  the  cells  may 
hove  fixed  geometry  including  pir.  locations  (  macro  cells  )  or  they 
may  have  fixed  area  with  a  giver,  aspect  ratio  range  ar.d  with  pins 
that  need  to  be  placed  ( custom  cells )  All  rotations  and 
refactions  of  each  cell  are  considered  TimberWolf  also  has  the 
ability  to  placa  cells  among  user-defined  sub-regions  of  tha  chip 
TlmbarWolf  allows  multiple  chips  to  be  placed  simultaneously.  This 
package  can  also  be  used  to  placa  circuits  on  one  or  more  printed 
circuit  boards 


2.  The  Basse  Algorithm 

Simulated  annealing  [1]  is  a  statistical  method  fer  solving 
general  combinatorial  optimization  problems  formulated  as  fel¬ 
lows.  Siven  a  problem  specified  by  a  state  z  which  is  an  element  of 
a  discrete  set  X  and  a  cost  function  c  (z) .  find  z*  suck  that  c(z') 
is  minimal 

2. 1  Algorithm  Structure 

Tha  following  function  gives  the  general  structure  of  the  algo¬ 
rithm. 

struct uredA]go(  z  ,C  )  ( 

/•  state  X  and  coal  function  c  •/ 
vhsi*(  "slopping  criterion"  is  cot  satisfied  )  f 
generate  a  new  state  z' , 
evaluate  c(  z'  )  ;  /•  new  cost  •/ 
Vtacmp»{c(z').c(z))l 
/• 

accept  returns  1  if  an 
acceptance  criterion  has  been 
satisfied  and  0  otherwise 

•/ 

*  «z’ ; 

I 

I 

I 

Note  that  the  important  part  of  the  algorithm  is  the  function 
accept.  Simulated  annealing  uses  a  statistical  method  tc  decide 
whether  to  accept  or  reject  the  new  state  A  parameter  T  ar.d  a 
random  number  generator  are  critical  to  the  strategy 
aocept(  c(z') ,  c(z) )  | 

/• 

given  the  cost  of  a  new  state  z  and  of 
the  previous  state,  z  ,  return  1  if  the  cost 
variation  posses  a  test  T  is  a  parameter 

•/ 

Ac  *  c(z')  •  c(z)  ; 

VtAc*0)l 

return^  1  )  ; 
leitt  | 

y  ■  exp(  -  Ac  /  T  ) ; 

r«neadmn(0,  1  ): 

r 

random  is  a  function  which  returns  s  pseudo 
random  number  between  C  ar.d  1  (  with, 
uniform  distribution ) 

V 

1flr<y)| 

Twfum<  1  ) . 

|«i*e  | 

*wfurn(  0 ) ; 

I 

I 

I 


In  the  function  ■ecept.  note  that  see  states  characterised  by 
£c«0  always  satisfy  the  acceptance  criterion  However,  for  tha 
c«w  states  characterized  by  Ac  >0.  the  parameter  T  plays  a  funda¬ 
mental  role.  LI  T  m  very  lar(a,  then  r  is  likely  to  be  less  than  y  and 
a  new  state  is  almost  always  accepted  irrespective  of  Ac .  H  T  is 
■mail,  close  to  0  .  then  only  new  states  that  are  characterized  by 
wery  small  Ac  >0  have  any  chance  of  being  accepted  In  general, 
all  states  with  Ac  >  0  have  smaller  chances  of  satisfying  the  teat 
for  smaller  values  of  T. 

Simulated  annealing  as  proposed  by  Kirkpatrick,  Gelatt.  and 
Vacchi  [l]  incorporates  an  automatic  mechanism  to  adjust  the 
walue  of  T  during  the  optimization  process  lr.  addition,  new  states 
are  generated  by  a  random  process  The  general  form  of  the  simu¬ 
lated  annealing  algorithm  follows 

wmulatedAnnwal(  x  ,  T  )  J 

/•  Given  ar.  initial  state  x  and  initial  parameter  T  •/ 
Wute(  "stopping  criterion"  a  not  satisfied  )  | 
generate  T  <  T  . 

t-t  . 

uhii»(  "inner  loop  criterion”  is  not  satisfied  )  | 
generate  a  new  state  x' ; 
evaluate  the  new  cost  r  '■*  ) ; 

Vt  «*■!>«(  c(*’).c(x;)  | 

X  =X  ; 

J 

I 

I 

I 

Simulated  annealing  was  developed  as  an  analogy  to  physical 
annealing  The  best  results  with  simulated  annealing  are  obtained 
by  starting  with  a  large  value  of  the  parameter  T,  whereby  virtu¬ 
ally  all  proposed  new  slates  are  accepted  Further,  the  best  results 
are  obtained  when  tha  system  is  allowed  to  achieve  equilibrium  at 
aach  stage  of  the  anr.aalmg  process  ( that  is.  far  each  value  of  T  ) 
This  is  implemented  by  the  "inner  loop  criterion"  in  the  simulated 
annealing  algorithm  The  "stopping  criterion"  is  satisfied  when  the 
cost  function  s  value  remains  tha  same  after  several  stages  of  tha 

imaating  [ircfl 

h  simulated  annaallng.  tha  bast  rasults  are  obtained  whan 
the  parameter  T  is  slowly  reduced  when  the  cost  function's  value 
begins  to  decrease  significantly  For  each  successive  step  of  the 
annealing  process.  T  is  lowered  exponentially  That  is.  T  m  T*a, 
with  0<a<l  The  parameter  a  car.  be  a  constant  or  car.  also  ha  a 
function  of  T  The  TmberWclf  programs  currently  allow  the  value 
of  a  to  be  specified  for  each  value  of  T  The  value  of  a  is  usually  In 
the  range  of  O  B  to  0.95 

22  Tha  Tim  bar*  df  Implanastatian  cf  the  Smsldad  Inn-ling 
Algorithm 

U1  GaamaUng  New  Salas 

The  TiraberWolf  programs  begin  with  a  random  initial  place¬ 
ment  or  wiring  configuration.  A  new  state  is  generated  by  either 
exchanging  two  fundamental  units  or  moving  a  unit  to  another 
location  For  the  gate  array  placement  program,  the  new  state  is 
generated  by  the  Interchange  of  two  module*,  where  a  module 
refers  to  a  fundamental  unit  specified  ir.  the  net  list  The  standard 
cell  placement  program  also  generates  naw  states  by  the  inter¬ 
change  of  cells.  However,  because  standard  calls  typically  vary  in 
width,  the  interchange  of  two  cells  often  results  in  a  non-feasible 
solution  because  overlaps  are  not  allowed  This  is  solved  by  a 
penalty  function  approach,  first  described  by  Kirkpatrick.  Gelatt. 
and  Vacchi  [1]  The  TtmberVotf  Implementation  of  this  approach 
will  be  described  lr.  the  next  section  The  penalty  function 
approach  is  also  employed  by  the  macro/custom  cell  placement 
program  beesuse  the  cells  tjrpically  vary  in  both  height  and  width. 

For  the  standard  cell  and  macro/custom  cell  problems,  new 
states  are  also  generated  by  the  movement  of  a  cell  to  a  new  loca¬ 
tion.  Experimental  investigation  has  revealed  that  the  use  of  both 
methods  of  generating  new  states  is  necessary  to  ichievt  the  best 
results  Furthermore,  orientation  changes  of  standard  and 
macro/custom  cells  are  performed  which  result  In  new  states 
New  states  are  also  generated  for  custom  calls  by  assigned  a  new 
locatior  to  a  pm  or  group  of  pins 


For  the  standard  cell  global  router  program  r.ew  states  are 
generated  by  assigning  a  portion  of  a  net  to  a  different  channel. 

2.2.2  Coat  Function 

The  cost  function  for  the  placement  programs  is  based  or 
total  estimated  wire  length  The  standard  cell  end  macro/custom 
call  programs  also  Include  a  penalty  function  term  The  cost  func¬ 
tion  for  the  standard  cell  global  router  is  based  or.  the  estimated 
wiring  area  which  is  approximated  by  the  sum  over  all  channels  of 
the  channel  density 
22.3  Genawting  New  Values  cf  T 

In  the  current  implementation  of  TimberWclf.  the  parameter 
a  is  user-specified  as  a  versus  T  data  The  best  results  have  beer 
obtained  when  a  is  the  largest  (  approximately  C  95  )  during  the 
stages  of  the  algorithm  when  the  cost  function  is  decreasing 
rapidly  Furthermore,  the  value  of  a  is  giver,  its  lowest  values  at 
the  initial  end  latter  stages  of  the  algorithm  ( usually  0  80  )  The 
value  of  a  is  gradually  increased  from  its  lowest  value  tc  its  highest 
value,  and  than  gradually  decreased  back  to  its  lowest  value 

224  Tha  Inner  Loop  Criterion 

The  inner  loop  criterion  is  implemented  by  the  specification 
of  the  number  of  new  states  generated  for  each  stage  of  the 
annealing  process  This  number  is  specified  as  a  multiple  of  the 
number  of  fundamental  units  for  the  placement  cr  routing  prob¬ 
lem  For  the  gate  array  placement  and  standard  cell  glcbal  rcuter 
programs,  20  new  states  per  unit  are  generated  at  each  stage  The 
standard  cell  and  macro/custom  cell  placement  problems  have 
many  more  degrees  of  freedom  (  orientation  charges,  pir.  location 
changes,  etc  )  and  hence  100  or  more  new  slates  are  generated 
per  cell  at  each  stage. 

225  Tha  Slapping  Criterion 

The  stopping  criterion  is  implemented  by  recording  the  cost 
function’s  value  at  the  and  of  each  stage  of  the  annealing  process. 
Tha  stopping  criterion  is  satisfied  when  the  cost  function's  value 
has  not  changed  for  4  consecutive  stages 


2  Standard  Cell  Placemen L  Optimization  Program 

21  Introduction 

This  program  optimizes  the  placement  ol  standard  cells  into 
row  and/or  column  blocks.  Furthermore,  the  various  blocks  may 
have  differing  heights.  The  program  also  optimizes  the  placement 
of  pads  or  buffer  circuitry,  as  well  as  macro  blocks  The  macro 
blocks  may  be  positioned  anywhere  or.  the  chip  The  estimation  ol 
the  wire  length  for  a  single  net  is  determined  by  computing  the 
half-perimeter  of  the  bounding  box  of  the  net  The  bounding  box  is 
defined  by  the  smallest  rectangle  which  encloses  all  of  the  pins 
comprising  the  net  For  the  case  of  a  two-pir.  r.et.  this  is  the 
Manhattan  distance.  Because  exact  pir.  locations  are  used  in  the 
wire  length  calculations,  TimberWolf  considers  all  possible  orienta¬ 
tion*  for  a  call,  pad.  or  macro  block  Pins  which  are  internally  cor- 
aected  within  a  cell  are  treated  as  a  single  pin  with  a  location 
which  is  the  average  of  the  locations  of  its  constituent  pins 

The  program  employs  the  txchangt  class  mechanism  for 
blocks  as  well  as  cells,  pads  and  macros  If  two  blocks  have  the 
aama  exchange  class,  then  cells  from  these  blocks  are  ir.tercfcang- 
able  Blocks  with  differing  exchange  classes  may  rot  have  their 
cells  interchanged  Differing  exchange  classes  lor  blocks  are  usu¬ 
ally  employed  when  blocks  have  different  heights  Furthermore 
two  cells  or  two  pads  may  be  Interchanged  only  if  they  belong  to 
the  tame  exchange  class 

22  Algorithm  Details 

Tha  cost  function  for  the  simulated  ar.realmg  a!r"ith~  ccr 
stits  of  two  Independent  portions  The  firs!  portion  is  the  totai 


estimated  wire  length.  The  second  portion  is  the  penally  function 
wtich  consists  of  a  totai  sum  of  overlap  penalties  This  penalty 
function  was  incorporated  because  of  the  usual  difference  in  width 
of  the  standard  cells  Often  two  cells  are  selected  for  interchange 
which  differ  in  width  Therefore,  an  exchange  of  location  of  these 
two  cells  often  results  in  some  overlap  with  one  or  more  of  the 
other  cells  Furthermore,  the  program  often  selects  a  single  cell 
for  a  displacement  to  a  new  location  Once  again,  some  overlap 
may  resiit  The  exchange  of  cells  or  the  displacement  of  a  single 
cell  may  also  result  in  a  portion  of  a  cell  dangling  or  the  end  of  a 
row  or  column  block  This  is  treated  as  a  case  of  overlap  with  an 
imaginary  cell  being  located  at  the  ends  of  each  column  and  row 
block  This  feature  increases  the  number  of  states  in  the  slate 
space  X  Experimental  investigation  has  shown  that  this  results  in 
better  placements 

When  two  standard  cells  overlap,  a  penalty  is  assessed  which 
is  proportional  to  the  square  of  quantity  of  the  amount  of  overlap 
plus  an  offset  parameter  The  offset  parameter  is  chosen  to  ensure 
that  when  the  parameter  T  approaches  zero,  then  the  total 
amount  of  overlap  approaches  zero 

The  alternative  to  the  aforementioned  overlap  concept  la  of 
course  to  not  allow  overlaps  For  example,  when  inserting  a  cell 
into  a  row  block,  if  insufficient  space  is  available  then  the  cells  to 
the  right  are  all  shifted  farther  to  the  right  as  necessary.  This  has 
the  obvious  disadvantage  of  destroying  the  relationships  between 
the  shifted  cells  and  the  cells  or.  the  neighboring  rows.  The  overlap 
concept  was  employed  so  as  to  not  disturb  the  placement  of  the 
remaining  cells  when  performing  ar.  interchange  of  cells  or  a  dis¬ 
placement  of  a  single  cell. 

The  selection  of  new  states  is  based  or.  the  following  con¬ 
siderations  (1)  A  random  cumber  between  one  and  the  total 
number  of  cells,  pads  ar.d  macro  blocks  is  generated  The  cells  are 
numbered  from  one  to  the  number  of  cells,  and  the  pads  and 
macro  blocks  are  numbered  starting  from  the  number  of  cells  plus 
one.  If  the  random  number  is  less  than  or  equal  to  the  cumber  of 
cells,  then  a  cel!  is  selected.  Otherwise,  a  pad  or  macro  block  is 
selected.  (2)  A  second  random  number  is  selected  between  1  and 
the  total  cumber  of  cells,  pads,  and  macro  blocks  (3)  If  the  two 
numbers  selected  both  represent  cells,  then  the  pair  of  cells  ore 
interchanged  to  generate  a  new  state.  (4)  Similarly,  If  two  pads  or 
two  macro  blocks  were  selected,  then  ar.  Interchange  constitutes 
the  new  state  (5)  If  the  two  numbers  selected  do  not  represent 
the  same  unit  ( that  is.  cell.  pad.  or  macro  block )  then  the  first 
unit  selected  governs  the  generation  of  a  new  state  If  this  first 
unit  was  a  cell,  then  this  oell  is  displaced  to  a  new  location  If  this 
new  state  is  rejected,  then  the  next  state  generated  is  an  orienta¬ 
tion  change  for  the  cell.  If  the  first  unit  was  a  pad  or  macro  block, 
then  an  orientation  change  of  the  respective  unit  is  attempted 

The  ratio  of  single  cell  displacements  to  cell  interchanges  has 
a  pronounced  effect  on  the  quality  of  the  final  placement.  Experi¬ 
mental  investigation  has  revealed  that  a  ratio  of  about  5  to  l  yields 
the  best  results  Hence,  If  the  first  unit  selected  was  a  cell,  the 
generation  of  the  second  random  number  is  weighted  to  produce 
the  desired  ratio  This  is  implemented  by  generating  a  random 
number  between  one  ar.d  the  number  of  cells  multiplied  by  5 

In  the  latter  stages  of  the  algorithm,  that  is,  when  the  value 
of  T  approaches  zero,  the  displacement  of  a  cell  has  very  little 
chance  of  being  accepted  unless  the  displacement  is  very  local. 
Similarly  an  exchange  of  distant  cells  has  a  vanishingly  small 
chance  of  being  accepted  Hence,  it  is  more  efficient  to  employ  a 
range  Knitter.  which  limits  the  renge  of  the  displacement  of  a  cell 
or  cells  Consequently,  during  the  latter  stages  of  the  algorithm, 
the  cells  undergo  many  small  displacements  while  gradually  elim- 
Inating  overlaps  and  reducing  wire  length 

Of  major  concern  to  all  Implementations  of  the  simulated 
annealing  algorithm  is  CPU  time.  The  TimberWolf  standard  cell  pro¬ 
gram  was  designed  to  reduce  computation  time  while  sacrificing 
storage.  Or.e  of  the  features  of  the  program  Is  that  computation 
time  per  iteration  is  constant  ( that  Is,  it  is  irvariant  with  the 
number  of  cells ).  The  Iteration  time  Is  defined  to  be  the  time 
required  to  generate  e  new  configuration,  evaluate  the  new  value  of 
the  cost  function,  and  then  decide  to  accept  or  reject  the  new 
configuration  Two  key  features  make  this  possible  (1)  The  cells  In 
e  block  are  hashed  Into  bins  that  partition  the  block's  coordinate 
eystem  Hence  overlap  calculations  require  a  constant  amount  of 
brae  (2)  The  possible  orientations  for  a  cell,  including  the  pin 
locstiona  for  each  orientation,  are  computed  at  the  outset  and  are 


stored  Thus,  to  change  a  cell  orientation,  only  a  pointer  charge  is 
required  rather  than  recomputing  the  cell  boundaries  and  pin  loca¬ 
tions. 

Many  currant  standard  cell  optimization  programs  attempt  to 
first  perform  an  interrow  optimization  ar.d  ther.  ar.  ir.tra-row 
optimization  That  ts.  each  cell  is  first  assgned  to  a  row  and  then  in 
a  second  step,  the  cells  are  placed  within  their  respective  row  The 
method  employed  by  Timber* olf  simultaneously  considers  both 
optimizations,  thus  yielding  much  reduced  routing  area 

3.3  Hamits 

The  program  was  interfaced  to  the  CIPAR  stardard  cell  place¬ 
ment  package  developed  by  American  Microsystems.  Ir.c  For  the 
larger  circuits  (  BOO  to  1500  cells  )  Timber* elf  reduced  total  wire 
length  from  45  to  573  in  comparison  with  CIPAR  Furthermore. 
final  chip  areas  were  reduced  by  at  least  303  Fcr  a  circuit  of  10CC 
cells,  TimberWolf  reduced  the  final  chip  area  by  31"  m  comparison 
to  CIPAR  and  by  213  over  another  standard  cell  place  and  route 
package  marketed  by  a  workstation  company 

The  computation  time  was  4  milliseconds  per  iteration  (VAX 
11/780  running  VMS  )  The  memory  requirement  is  linearly  related 
to  the  number  of  cells  For  the  largest  circuit  (  1500  cells  ),  20 
million  iterations  were  performed  for  some  or  the  better  place¬ 
ments  This  implies  nearly  24  hours  of  CPU  time  The  memory 
requirement  for  this  circuit  was  2  megabytes  ( 32-bit  integers  are 
used  ).  The  results  are  summarized  ir.  Table  1 

The  TTT  and  QUAUC  circuits  could  r.ot  have  their  areas 
reduced  more  than  15"  due  to  pad  limitaticr  There  are  two  ver¬ 
sions  of  the  TELEBIT  circuit.  The  seccr.d  version  has  very  many  of 
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TabU  1  Summary  of  Remits 

TimberWolf  Standard  Cell  Placement  Optimization  Program 
Circuits  Provided  Through  the  Courtesy  cf 
American  Microsystems,  Ir.c 

Its  cells  specified  to  occur  in  fixed  sequences.  Hence  the  number 
of  states  X  in  the  state  space  X  is  significantly  reduced  It  has 
been  experimentally  observed  that  a  reduction  of  the  cardinality  of 
the  state  space  results  in  s  lower  winrg  area  reduction 

The  effect  of  the  TimberWoir  placement  optinizalicr  car.  be 
Further  demonstrated  by  the  number  or  route-through  cells  which 
were  required  A  route-through  cell  has  two  internally  connected 
pins,  one  on  the  top  ar.d  one  or.  the  bottom  If  a  portion  of  a  r.et 
must  connect  two  cells  whjch  are  r.ot  or  the  sane  rcw  and  are  col 
on  neighboring  rows,  then  this  net  must  be  routed  through  the 
rtwrs  between  those  containing  the  cells  A  route-through  cell 
must  be  Inserted  to  accomplish  this  for  the  case  of  two  levels  of 
Interconnect 

For  the  QUALIC  circuit,  the  number  of  route-through  cells  was 
reduced  from  60  to  14.  Furthermore,  the  number  of  route 
throughs  weu  reduced  from  51  to  zero  for  the  ITT  circuit.  For  the 
XEROX  circuit,  more  than  1000  route-through  cells  were  elim¬ 
inated  All  of  the  approximately  300  route-through  cells  were  elim¬ 
inated  for  the  1500-cell  TELEBIT  circuit 


4.  Standard  Cell  Clobal  Router  Program 
4 1  Introduction 

The  layout  of  a  standard  cell  circuit  oher.  cors.sts  or  rows  or 
cells  bordered  by  p  ds  and/or  buffer  circuitry  Ir.  order  to  r-unm- 


ize  the  need  for  route-through  cells  (  which  increase  the  area  of  a 
circuit ).  the  cells  ore  typically  designed  with  electrically 
equivalent  ( internally  connected )  pins  or.  both  the  top  and  bot¬ 
tom  side  Thus  a  net  from  above  can  be  connected  to  the  top  pin 
while  the  —  me  net  from  below  car.  be  connected  to  the  bottom 
pm.  The  internally  connected  pins  ore  referred  to  as  a  pw»  cluster. 
A  portion  of  a  net  which  must  connect  two  pic  clusters  is  referred 
to  as  a  net  segment 

It  often  arises  that  a  pic  cluster  from  one  cell  must  be  con¬ 
nected  to  a  pin  cluster  from  another  cell  or.  the  same  row  If  each 
such  cluster  has  a  top  pic  and  a  bottom  pic.  then  this  net  segment 
is  defined  as  being  suntcheM*  A  decision  must  be  made  as  to 
whether  to  route  the  switchable  net  segment  in  the  channel  above 
or  below  the  row.  The  TimberWolf  global  router  assigns  switchable 
net  segments  to  channels  based  on  the  minimisation  of  the  total 
channel  density  The  total  channel  density  is  defined  to  be  the 
sum  of  the  channel  densities  for  all  of  the  chancels 

The  TimberWolf  global  router  routes  all  nets  and  considers  all 
pins  except  those  nets  and  pins  which  route  power  and  ground.  It 
is  often  the  case  (as  with  CIPAR)  that  separate  routines  are  used 
to  route  power  and  ground.  The  global  router  takes  Into  considera¬ 
tion  pins  on  the  outer  pads  or  buffer  cells. 

Some  standard  cell  place  end  route  systems  (  for  example. 
CIPAR)  do  not  employ  a  global  router  Instead,  only  a  channel 
router  is  used  and  it  routes  as  many  connections  as  passible  for 
each  channel.  Thus  tha  order  In  which  the  channels  are  routed  can 
have  a  substantial  affect  on  the  total  number  of  wiring  tracks 
required  (  and  thus  the  area  of  the  circuit )  In  contrast,  after 
using  the  TimberWolf  global  router,  specific  pins  have  beer, 
identified  for  interconnection  Thus  the  number  of  wiring  tracks 
required  is  independent  of  the  order  in  which  the  channels  are 
routed. 


The  minimum  spanning  tree  is  generated  for  the  graph  via 
Kruskal's  algorithm  [3].  This  portion  of  the  algorithm  effectively 
generates  e  Sterner  tree  [4]  for  the  ir.terccnr.ection  o'  the  clus¬ 
ters.  When  the  minimum  spanning  tree  has  beer  generated  pairs 
of  pm  clusters  have  beer,  identified  which  ere  to  be  connected  by  a 
net  segment. 

Stop  2 

In  this  step,  each  edge  of  the  mininum  spanning  tree  is 
examined,  and  one  pin  from  each,  cluster  is  selected  to  form  the 
actual  net  segment  In  the  case  of  an  edge  connecting  two  clusters 
or,  the  same  row,  it  is  determined  if  this  is  a  switchable  net  seg¬ 
ment.  If  the  segment  is  switchable.  then  two  pairs  cf  pms  ore 
selected.  One  pair  is  for  the  segment  routed  m  the  channel  above 
the  rcrw  and  another  pair  is  for  the  segment  routed  in  the  channel 
below  the  row 

Pm.  selection  proceeds  as  follows  (])  For  the  case  of  two  clus¬ 
ters  on  neighboring  rows,  the  bottom  pir.  of  tr.e  top  cluster  and  the 
top  pm.  of  the  bottom  cluster  are  selected  based  or  the  minimisa¬ 
tion  of  the  Manhattan  distance  between  the  two  points  (2)  For  the 
case  of  two  clusters  on  the  same  row  (a)  If  the  edge  is  determined 
to  be  switchable,  the  top  pin  from  each  cluster  is  selected  based 
on  the  minimisation  of  the  distance  between  the  two  points  Also, 
the  bottom  pin  from  each  cluster  is  similarly  selected  (b)  If  the 
edge  is  not  switchable,  either  the  pair  of  top  pics  ( if  the  segment 
must  be  routed  in  the  channel  ebcve  the  row )  or  the  pair  of  bot¬ 
tom  pins  ( if  the  segment  must  be  routed  ir.  the  channel  below  the 
row  )  are  selected.  The  pin  selection  is  again  based  on  tr.e  minimi¬ 
sation  of  the  segment  length 

4.2.2  Second  SUgeof  the  Global  Router  Algorithm 


<2  Global  Router  Algorithm 

The  TimberWolf  global  router  performs  the  optimization  in 
two  stagas  The  Erst  stage  examines  esch  net  separately  Two 
basic  steps  are  applied  to  each  net.  (1)  The  first  step  identifies 
which  pairs  of  pin  clusters  are  to  be  connected  based  an  the 
minimization  of  the  Manhattan  interconnection  distance  This 
results  in  the  identification  of  tha  net  segments.  (2)  The  second 
etep  considers  esch  net  segment  and  selects  a  plr.  from  each  clus¬ 
ter  such  that  the  Manhattan  length  of  the  segment  is  minimized. 
Two  pairs  of  pins  ere  selected  far  each  switchable  net  segment 

The  second  stage  results  in  the  assignment  of  a  channel  for 
each  switchable  net  segment  The  two  stages  are  detailed  below. 

421  First  Stage  of  the  Globel  Rooter  Algtsithm 

The  first  stage  consists  of  applying  the  two  steps  detailed 
below  to  each  net  separately 

Stop  1 

For  a  giver,  net.  the  pir.  clusters  that  need  to  be  connected 
■re  determined  A  graph  is  formed  ir.  which  the  clusters  ere 
represented  by  the  nodes  and  connections  between  the  nodes 
( the  formation  of  potential  net  segments  )  are  represented  by 
edges.  An  edge  connects  two  nodes  if  e  net  segment  could  possibly 
connect  the  two  clusters.  For  example,  two  clusters  can  be  con¬ 
nected  only  if  one  of  the  following  two  conditions  is  true  (I)  They 
he  on  the  same  row  ,  with  no  Intervening  cluster  occupying  the 
seme  row.  This  is  the  case  of  a  potential  switchable  net  segment 
The  net  segment  is  switchable  if  each  cluster  has  a  pin  on  the  top 
and  on  the  bottom  of  the  row.  That  is.  the  net  segment  could  be 
routed  either  ir.  the  channel  above  the  row  or  in  the  channel  below 
the  row.  (2)  They  lie  or.  neighboring  rows  Furthermore,  there 
cannot  be  another  cluster  lying  between  the  two  clusters  which 
occupies  either  of  the  rows  occupied  by  the  two  clusters 

The  result  of  conditions  (1)  end  (2)  above  Is  that  the  the  max¬ 
imum  degree  of  a  node  is  4  Further,  this  maximum  degree  is 
achieved  when  a  given  cluster  is  to  be  connected  to  two  clusters  in 
the  row  above  (  one  to  the  left  and  one  to  the  right )  end  to  two 
clusters  Ir.  the  row  below  (  also  on*  to  the  left  and  one  to  the 
right) 


This  step  employs  a  simulated  annealing  algorithm  The  net 
segments  (  for  all  oT  the  nets  )  with  their  respective  pins  are  sup¬ 
plied  es  input.  One  half  of  the  minimum  cortact-to-ccr.tact  spac¬ 
ing  is  added  to  each  end  of  the  horizontal  spar,  cf  each  segment 
For  each  switchable  segment,  or.  arbitrary  initial  selection  ( of 
above  or  below  the  tow)  is  made.  Each  channel  is  examined 
sequentially  to  determine  Its  density.  The  densities  cf  the  char¬ 
nels  are  summed,  and  this  sum  is  the  initial  value  cf  the  coat  func¬ 
tion.  A  new  state  of  the  configuration  is  generated  by  the  random 
selection  of  a  switchable  segment  and  then  routing  it  or.  the  oppo¬ 
site  side  of  the  row  from  its  current  position.  As  a  result  of  the 
new  state,  the  cost  Tunctior.  either  increases  by  1.  decreases  by  1. 
or  remains  the  same.  That  is.  the  total  channel  density  changes  by 
at  most  I. 

The  ease  of  no  change  ir.  the  cos',  is  treated  further  This  is 
the  cose  in  which  the  net  segment  switch  has  no  effect  or.  the  total 
channel  density.  A  second  cost  function  is  introduced  ir.  this  case 
This  cost  function  is  a  measure  of  the  congestion  ir.  a  channel 
between  the  two  points  defining  the  spar,  of  e  net  segment  The 
cost  function  is  evaluated  by  taking  the  difference  betweec  the 
overall  channel  density  and  the  density  between  the  two  points 
defining  the  spar.  The  coot  function  is  first  evaluated  for  the  span 
of  the  net  segment  in  the  original  channel  Next,  the  cost  function 
is  evaluated  for  the  net  segment  spar  ir.  the  r.ew  channel  The 
difference  Ir.  cost  ( Ac  )  is  determined  by  subtracting  the  second 
cost  function  value  from  the  first  A  negative  value  of  Ac  indicates 
that  switching  the  cet  segment  to  the  new  channel  places  the  seg¬ 
ment  in  e  channel  of  less  congestion 
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The  global  router  reduced  the  number  of  wiring  tracks  used 
by  the  CIPAR  router  by  10  to  15X  Because  routing  typically  occu¬ 
pies  one  half  of  the  chip  area,  this  translated  to  ar  overall  area 
savings  of  5  to  7S. 

The  globel  router  was  applied  to  the  QcALIC  circuit  after  a 
TimberWolf  placement  optimization  The  total  number  of  wiring 
tracks  used  without  the  global  router  was  72  After  employing  the 
globel  router.  65  tracks  were  used  This  represents  a  1C”  reduc¬ 
tion  in  wiring  aree 

The  largest  circuit  (  1500-cell  "ELEB1*  )  was  tested  with,  and 
without  TimberWolf  placement  After  the  optimized  placement  the 


global  router  reduced  the  overall  chjp  area  by  6.1?!.  With  only 
CIPAR  placement,  the  area  reduction  was  5  5?!  A  total  area  sav- 
tegs  of  347  was  achieved  for  the  THEBTT  circuit  when  both  Tta- 
berWolf  placement  optimization  and  the  global  router  were 
applied.  The  results  are  summarized  in  Table  2. 
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Tabit  Z  Summary  of  Results 
TimberWolf  Standard  Cell  Optimization  Programs 
Circuits  Provided  Through  the  Courtesy  of 
American  Microsystems,  Inc 

The  TELEBIT  circuits  that  were  tested  did  not  have  any  cells 
belonging  to  fixed  sequences  TELEBITa  refers  to  a  TimberWolf- 
placed  circuit  and  TELEBITb  refers  to  a  CIPAR-placed  circuit.  The 
additional  area  reduction  value  for  the  QL'ALIC  circuit  represents 
the  wiring  area  reduction,  since  the  overall  chip  area  was  pad  lim¬ 
ited  The  XEROX  circuit  was  placed  using  TimberWolf. 


5.  Cate  Array  Placement  Optimisation  Program 
5.1  Introduction 

This  section  describes  the  generalized  gate  array  placement 
program  Each  fundamental  unit  in  a  gate  array  will  be  referred  to 
as  a  cell  Her.ce,  a  SO  by  50  gate  array  is  said  to  have  2500  cells. 
Some  gate  array  designs  allow  additional  flexibility  ar.d  hence 
greater  gate  utilization  by  creating  functionally  independent  units 
within  a  cell  For  example.  Tektronix  gate  arrays  widely  utilize 
functional  units  which  are  half-cell  sized  TimberWolf  cllows  the 
functional  units  to  be  half-cell  sized  or  quarter-cell  sized  The 
term  module  will  refer  to  a  fundamental  unit  specified  In  the  net 
list.  A  module  may  be  the  size  of:  (1)  a  full  cell.  (2)  a  half  cell  or  (3) 
a  quarter  cell.  Additionally,  macro  modules  may  be  specified.  A 
macro  module  consists  of  a  pre-wired.  arhitrarily-shaped  collection 
of  cells 

TimberWolf  has  other  features  which  provide  additional  flexi¬ 
bility  For  example,  a  module  ( or  macro  module  )  may  ba  desig¬ 
nated  as  immoveable  ( that  is.  preplaced  )  or  as  belonging  to  an 
exchange  class  of  modules  The  modules  in  such  a  class  may  only 
be  interchanged  among  themselves  This  feature  is  often  desirable 
when  a  group  of  modules  on  the  edge  of  the  gate  array  are  to  be 
considered  os  primary  terminals  Often  the  exact  location  of  a 
given  primary  terminal  is  not  important,  only  that  it  be  on  a  given 
edge. 

It  is  often  the  case  that  gate  arrays  have  wider  channels  in 
the  center  of  the  array  This  is  in  anticipation  of  the  greatest  wir- 
ing  congestion  occurring  it.  this  region  Because  prewired  macro 
modules  usually  have  a  fixed  cell-to-celi  spacing  certain  macros 
may  not  be  placed  ir.  the  center  region  (  or  the  outer  regions ). 
TimberWolf  allows  the  designation  of  cell  locations  as  either  suit¬ 
able  or  unsuitable  fora  particular  set  of  macro  modules 

&2  flats  Array  Waneiseet  Algwitkm 

The  TimberWolf  gate  array  placement  program  can  be  used 
with  either  of  two  cast  functions  The  first  cost  function  is  based 
or.  the  computation  of  net-crossing  hvto/iumz  for  each  horizontal 
and  vertical  channel  of  the  placement  region  The  histograms  are 
computed  by  considering  the  bounding  box  of  each  net  and  adding 
1  to  the  histogram  for  each  channel  intersecting  th»  hounding  box 
The  sum  of  the  histogram  values  for  each  horizontal  and  vertical 
channel  is  equivalent  to  summing  the  half  perimeters  of  the  bound¬ 
ing  boxes  of  each  net  Further,  a  net-croseing  threshold  value  is 
aesigred  to  each  channel  If  the  r  umber  of  nets  crossing  a  channel 
exceeds  the  specified  threshold  value,  a  penalty  is  assessed  pro¬ 
portional  to  the  square  of  the  number  of  net  crossings  exceeding 
the  threshold  The  threshold  mechanism  has  the  effect  of  evening 


out  the  wiring  congestion  during  the  earlier  stages  of  the  anneal¬ 
ing.  This  has  shown  to  result  ir.  a  lower  vaiue  cf  the  tela’  wire 
length.  A  partitioning  effect  may  be  produced  by  setting  the  thres 
hold  of  a  part'-uler  channel  to  zero  or  a  negative  vaiue  Ir  this 
oase,  nets  cro.jing  this  channel  will  be  severely  penalized 

The  formulation  of  the  cost  function  in  terms  of  net-crossing 
histograms  and  threshold  values  was  first  introduced  by  Kirxpa- 
tnck,  Gelatt,  end  Vecchi  [  1). 

A  second  cost  function  for  this  program  examines  the  local 
routing  congestion  more  closely  For  this  cost  fur.cticc.  each 
channel  sepmeni  is  assigned  a  threshold  volue  A  channel  segment 
is  a  portion  of  a  horizontal  or  vertical  channel  with  a  length  eaua! 
to  the  cell-center  to  cell-center  spacing  in  that  regicn  of  the  array. 
For  example,  if  the  bounding  box  of  a  ret  encompasses  2  cells  ir. 
the  horizontal  direction  and  3  cells  ir.  the  vertical  direction,  then  a 
total  of  17  -segments  are  enclosed  by  the  bounding  box  The 
congestion  per  channel  segment  introduced  by  this  net  is  approxi¬ 
mated  as  the  half  perimeter  of  the  bounding  bex  (  5 )  divided  bv 
the  total  number  of  segments  enclosed  (  17  )  The  factor  of  5/17  is 
the  estimated  probability  of  occupancy  for  the  given  ret  ir  each  of 
the  17  segments  The  giver,  net  contributes  zero  to  all  other  seg¬ 
ments  The  summation  of  the  occupancy  probabilities  over  all  nets 
for  a  given  segment  is  an  estimate  of  the  number  of  wiring  tracks 
required.  The  cost  function  is  then  the  sum  of  the  expected  occu¬ 
pancy  of -each  segment  plus  a  penalty  assessed  fer  each  segment 
which  has  occupancy  exceedirg  the  ccrrespcrd.rg  threshold. 
Specifying  a  threshold  value  for  each  channel  segment  which 
reflects  the  actual  fixed  channel  width  increases  the  likelihood  that 
the  final  placement  will  be  routable  Furthermore,  the  total  wire 
length  will  be  minimized  within  the  limits  of  these  constraints 

5.3  Results 

Experiments  ore  currently  being  initiated  or.  large  gate  array 
problems  To  test  the  program  ar.d  ccmpare  it  with  existing  place¬ 
ment  techniques,  a  set  of  standard  benchmarks  have  been  con¬ 
sidered.  These  benchmarks  are  the  ILLLAC  TV  computer  boards 
reported  by  Stevens  [5].  Note  that  the  printed  circuit  board  prob¬ 
lem  as  stated  for  theie  examples  is  a  particular  case  cf  the  general 
gate  array  placement  problem  described  in  the  previous  subsec¬ 
tion. 

Wire  length  for  a  net  was  estimated  by  computing  or.e  half  of 
the  perimeter  of  the  net's  hounding  box  The  figure  of  merit  is  the 
sum  of  the  estimated  wire  lengths  for  each  net 

Three  of  the  ILLfAC  IV  computer  hoards  were  tested  (1)  The 
largest  example  required  the  placement  of  151  modules  on  an  1 1  X 
15  board  TimbarWolf  reduced  the  total  wire  length,  by  21?!  over 
Stevens'  result  and  by  17%  over  the  result  published  by  Goto  and 
Kuh  [6].  (2)  The  second  example  required  the  placement  cf  10E 
modules  on  an  8  X  15  board  TimberWolf  reduced  the  total  wire 
length  by  277  over  the  result  published  by  Goto  and  Kuh  (3)  The 
third  example  required  the  placement  of  67  modules  on  a  5  X  15 
board  TimberWolf  reduced  the  tola!  wire  length  by  177  over 
Stevens'  result  and  6"  over  the  result  published  by  Gctc  acd  Kuh 

The  value  of  a  remained  at  a  constant  value  of  C  90  for  each  of 
the  examples  The  results  are  summarized  in  Table  3  CPU  times 
are  for  a  VAX  1 1  /78C  running  UNIX 


Circuit 
(#  modules) 
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Time  ; 
in  Mins 
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580 _ 
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Tabls  3  Summary  of  Results 
TimberWolf  Gate  Array  Placement  Program 


R  Macro /Custom  Placonent  Optimization  Program 
61  Introdoctitm 

This  program  optimizes  the  placemert  of  nacre  cells  and 
emtora  ceils,  as  well  as  pads  The  term  macro  cell  will  be  used  tc 
rdfar  to  a  cell  contained  in  a  cell  library  That  is  the  dimensions  of 


the  cell  are  krowr.  as  are  the  pin  locations  The  term  custom  cell 
will  be  used  to  refer  to  a  blcck  of  circuitry  kr.owr.  only  to  occupy 
ec  estimated  area  and  to  possets  a  list  of  pics 

The  program  places  circuits  comprited  solely  of  macro  cells 
as  well  as  circuits  comprised  entirely  of  custom  cells  Further¬ 
more  the  program  will  place  circuits  consisting  of  a  combination 
of  macro  and  custom  cells  The  macro  cells  and  custom  cells  may 
be  of  any  rectilinear  shape 

TimberWolf  allows  the  specification  of  lower  and  upper  bounds 
for  the  aspect  ratio  of  a  custom  cell  If  a  range  of  aspect  ratios  is 
gieen  for  a  custom  cell.  TimberWolf  will  choote  the  shape  of  the  cell 
which  minimises  chip  area. 

Wire  length  calculations  are  based  or.  the  exact  pin  locations. 
Thus  all  possible  orientations  are  considered  for  each  cell 

Another  feature  of  TimberWolf  is  the  multiple  region  capabil¬ 
ity.  This  feature  incorporates  either  a  division  of  the  chip  into 
regions  or  the  placement  of  multiple  chips  simultaneously  Inter¬ 
changes  of  cells  from  different  regions  are  permitted  only  if  the 
regions  belong  to  the  same  exchange  class  The  exchange  class 
mechanism  is  extended  to  individual  cells  as  well 

Pics  are  specified  m  several  possible  ways  (1)  A  pin  may  be 
giver,  a  particular  fixed  location  (2)  A  pm  may  be  assigned  to  a 
particular  side  or  sides  of  the  cell.  (3)  A  group  of  pins  may  be 
assigned  to  a  particular  side  or  sides  of  a  cell  (4)  A  group  of  pins 
may  be  assigned  to  a  particular  sequence  as  well  as  a  particular 
side  or  sides. 

&2  llacns/Custoni  Cell  Placement  Algorithm 

The  number  of  possible  locations  at  which  an  uncommitted 
pir.  could  be  placed  or.  a  custom  cell  can  often  number  into  the 
thousands  Execution  time  considerations  (  as  ir.  the  standard  cell 
program )  require  that  the  pir.  locations  be  stored  for  each  orien¬ 
tation  of  the  cell.  Clearly  the  amount  of  storage  required  can 
become  excessively  lerge  This  potential  problem  is  averted  by 
defining  a  specified  number  of  pm  sites  approximately  evenly 
spaced  along  the  periphery  of  a  cell  Furthermore,  each  site  is 
assigned  a  capacity  The  capacity  is  a  function  of  the  number  of 
pin  locations  encompassed  by  the  site  During  the  annealing 
stages  pins  are  assigned  tc  sites  Upon  completion  of  the  anneal¬ 
ing  algorithm,  the  pins  for  a  giver,  site  are  assigned  to  locations 
within  the  scope  of  the  site  based  or.  the  minimization  of  wire 
length  For  accuracy  considerations,  the  number  of  pm  sites  that 
are  declared  for  a  given  placement  problem  ia  uaually  limited  only 
by  memory  capacity. 

The  location  of  the  pins  or.  a  macro  cell  are  taken  exactly. 
That  is.  their  location  is  not  approximated  by  the  pin-site  mechan¬ 
ism.  The  same  is  true  for  fixed-location  pins  or.  custom  cells  { if 
any  are  so  specified  )  The  capacity  for  a  site  in  the  vicinity  of  a 
fixed-loeetior  pm  is  correspondingly  reduced. 

The  cost  function  consists  of  two  independent  parts.  The  first 
part  is  the  total  estimated  wire  length  which  is  based  or.  the  sum 
over  all  nets  of  the  half-perimeter  of  a  net's  bounding  box.  The 
second  is  the  penalty  function  The  penalty  function,  consists  of  two 
parts.  (1)  The  first  part  is  the  sum  of  the  overlap  penalties  for  the 
cells  This  penalty  function  was  incorporated  because  of  the  usual 
difference  in  the  size  end  shape  of  the  cells  Often  two  cells  are 
selected  for  interchange  which  differ  ir.  size  and/or  shape  There¬ 
fore.  an  exchange  of  location  of  these  two  cells  often  results  lc 
some  overlap  with  or.e  or  more  of  the  cells  Furthermore,  the  pro¬ 
gram  often  selects  a  single  cell  for  a  displacement  to  a  new  loca¬ 
tion  or  ar.  asoect  ratio  change  f  ir.  the  case  of  custom  cells  V  Once 
again,  some  overlap  may  result  The  penalty  assessed  for  ar.  over¬ 
lap  of  twc  cells  is  equal  to  the  square  of  the  quantity  of  the  area  of 
overlap  plus  an  offset  value  The  offset  parameter  is  selected  to 
ensure  that  as  the  parameter  T  approaches  zero,  then  the  total 
overlap  approaches  zero  (2)  The  second  part  is  the  sum  of  the 
penalties  assessed  for  the  contents  of  a  pin  site  exceeding  its  capa¬ 
city  When  a  pin  is  displaced  from  an  original  site  to  a  new  site,  the 
contents  of  the  old  site  is  reduced  by  1  and  the  contents  of  the  new 
■ite  a  Increesed  by  1  The  penalty  assessed  for  a  site  is  a  product 
of  the  square  of  the  amount  by  which  the  contents  exceed  the 
capacity,  times  a  factor  lnveroely  related  to  the  capacity  of  the 
site  This  fector  refects  the  feet  that  exceeding  the  capacity  by  a 
giver,  amount  Is  e  more  serious  violation  for  the  sites  with  smaller 
capacities 

firm  states  ear.  be  generated  in  several  possible  ways  (1)  A 
pair  of  cells  (  either  could  be  a  macro  cell  or  a  custom  cell )  ere 
selected  for  interchange  (2)  A  single  cell  is  selected  fora  displace¬ 
ment  to  a  new  location  (3)  A  amgle  cell  is  selected  forar.  orienta¬ 
tion  change  (4)  A  custom  cell  is  selected  for  ar.  aspect  ratio 
charge  (5)  Ar.  uncommitted  pir.  (  or  sequence  of  pins )  is  assigned 
to  a  new  site  (  or  sites  1 

■  he  ratio  of  single  cell  displacements  to  cell  Interchanges  has 
e  sigrjficart  effect  or  the  quality  of  the  final  placement  Initial 
experimental  investigation  has  revealed  that  the  best  results  are 


obtained  when  the  ratio  is  about  1C  to  1 

The  strategy  for  generating  r.ew  states  «t  based  cr.  the  t» 
leg1  (!)  A  random  number  be-weer.  ore  and  the  rL-ite*  ar  ce ;s 
generated  The  cells  are  numbered  sequentially  from  cr.e  ( Z ‘  A 
second  random  number  is  generated  between  1  and  the  number  of 
cells  times  10  (3)  If  the  two  numbers  both  represent  ce,.s  then 
the  pain  of  cells  are  interchanged  to  generate  a  new  stale  (4  1! 
only  the  first  number  represents  a  cel1  then  the  new  state  is  ge¬ 
erated  by  the  displacement  of  the  cell  to  a  randomly  selected  .c-a- 
tior.  If  thus  r.ew  state  was  rejected  the  next  state  generated  ie  a- 
orter.tetion  change  for  the  cell  Similarly  if  this  r.ew  state  was 
rejected  and  if  the  cell  is  a  custom  cell  ther  the  next  stale  gen¬ 
erated  is  ar.  aspect  ratio  charge  Finally,  if  thus  new  state  was 
rejected,  ther.  a  new  state  is  generated  by  the  selectic-  ef  a 
uncommitted  pir.  or  group  of  uncommitted  pins  for  transfer  tc  a 
new  pin  site  or  aites 

In  the  latter  stages  of  the  algorithm  that  is.  when  the  va  -e 
of  T  approaches  zerc,  the  displacemert  cf  a  ce!  has  ver;.  htt.e 
chance  of  being  accepted  unless  the  displacement  is  very  loco 
Similarly,  ar.  interchange  of  distant  cells  has  a  vanishing  y  sr-a. 
chance  of  being  accepted  Thus  a  range  limiter  is  employed  which 
limits  the  range  of  the  displacement  of  a  cei'  cr  cells  Conse¬ 
quently.  during  the  latter  stages  of  the  algorithm  the  cel.s 
undergo  many  small  displacements  while  gradually  reducing  wire 
length  and  forcing  the  penalty  function  to  zerc 

The  TimberWolf  mecro/custom  cell  placement  optimization, 
program  is  currently  beirg  interfaced  to  CIPAR  for  testing  pur- 

^°*el  7.  Conclusions 

The  TimberWolf  placement  ar.d  routing  package  has  beer 
ahowr  to  provide  substantial  chip  area  savings  ir  comparison,  tc 
existing  standard  cell  layout  pregrams  Substantial  wire  length 
reductions  were  else  achieved  for  the  gate  array  placemen  pre- 
gram  for  some  benchmark  examples  The  TimberWolf 
macro/custom  program,  beirg  tested  row,  is  appheab.e  lc  place¬ 
ment  problems  as  complex  as  a  mdli-ehip  design,  employing  a 
combination  of  macro  cells  and  custom  cells 

The  TimberWolf  package  has  demonstrated  that  the  simulated 
annealing  optimization  technique  is  able  tc  capture  a  wide  rarge  of 
user  requirements  Our  research  group  is  alsc  actively  engaged  ir. 
a  theoretical  investigation  of  the  simulated  annealing  optimization 
technique 

The  TimberWolf  package  will  be  interfaced  to  the  SQUID  data¬ 
base  developed  by  our  research  group  Ir.  addition,  the  package 
will  be  interfaced  with  the  YACR  channel  router,  providing  a  com¬ 
plete  standard  cell  placement  and  routing  package 

Tbe  TimberWolf  placement  and  routing  package  is  written  in 
the  C  programming  language  The  package  currently  runs  under 
both  tha  VAX/UNIX  and  VAX  AVS  operating  systems  The  package 
Is  easily  convertible  to  other  sjrstems  supporting  the  C  language 
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Corner  Stitching:  A  Data-Structuring  Technique 

for  VLSI  Layout  Tools 
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Abstract -Comet  stitching  is  a  technique  for  representing  rectangular 
two-dimensional  objects.  It  is  especially  well  suited  for  interactive  VLSI 
layout  editing  systems.  The  data  structure  has  two  important  features: 
first,  empty  space  is  represented  explicitly;  and  second,  rectangular  areas 
arc  stitched  together  at  their  corners  like  a  patchwork  quilt.  This  orga¬ 
nization  results  in  fast  algorithms  (linear  or  constant  expected  time)  for 
searching,  creation,  deletion,  stretching,  and  compaction.  The  algorithms 
arc  presented  under  a  simplified  model  of  VLSI  circuits,  and  the  storage 
requirements  of  the  structure  are  discussed.  Corner  stitching  has  been 
implemented  in  a  working  layout  editor.  Initial  measurements  indicate 
that  it  requires  about  three  times  as  much  memory  space  as  the  simplest 
possible  representation. 


I.  Introduction 

INTERACTIVE  LAYOUT  tools  for  integrated  circuits  place 
special  burdens  on  their  internal  data  structures.  The  data 
struciitrcs  must  he  able  to  deal  with  large  amounts  of  informs- 
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work  was  supported  in  pari  by  lhc  Defense  Advanced  Research  Projccls 
Agency  (Dot)).  I)ARP\  Order  3803,  monitored  by  lhc  Naval  Elec¬ 
tronic  Syslem  Command  under  Contract  N00039-8I  -K-025I. 

The  author  is  with  the  Computer  Science  Division.  Department  of 
Electrical  1  neinccrtne  and  Computer  Sciences.  University  of  California. 
Berkeley.  CA  94’20, 


tion  (one-half  million  or  more  geometrical  elements  in  current 
layouts  [7] )  while  providing  instantaneous  response  to  the  de¬ 
signer.  As  the  complexity  of  design  increases,  tools  must  give 
more  and  more  powerful  assistance  to  the  designer  in  such  areas 
as  routing  and  validation.  To  support  these  intelligent  tools, 
the  underlying  data  structures  must  provide  fast  geometrical 
operations,  such  as  locating  neighbors  for  stretching  and  com¬ 
paction,  and  locating  empty  space  for  routing.  The  data  struc¬ 
tures  must  also  permit  fast  incremental  modification  so  that 
they  can  be  used  in  interactive  systems. 

Corner  stitching  is  a  data-structuring  technique  that  meets 
these  needs.  As  described  here,  it  is  limited  to  designs  with 
Manhattan  features  (horizontal  and  vertical  edges  only);  but 
within  that  framework  it  provides  a  variety  of  powerful  opera¬ 
tions,  such  as  neighbor-finding,  stretching,  compaction,  and 
channel-finding.  The^ilgorithms  for  the  operations  depend  only 
on  local  information  (.the  objects  in  the  immediate  vicinity  of 
the  operation).  Their  expected  running  times  arc  generally 
linear  in  the  number  of  nearby  objects;  in  pathological  cases 
(which  arc  unlikely  for  actual  layouts)  the  running  times  may 
be  proportional  to  the  overall  design  size  or  to  the  product  of 
nearby  objects  and  design  size.  Corner  stitching  is  especially 
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effective  when  the  objects  are  relatively  uniform  in  size,  as  is 
the  case  for  low-level  mask  features.  However,  it  also  works 
well  when  there  is  variation  in  feature  size.  This  occurs,  for 
example,  in  a  hierarchical  layout  where  one  cell  might  contain 
a  few  large  subcells  and  many  small  wires  to  connect  them 
together. 

Corner  stitching  permits  modifications  to  the  database  to  be 
made  quickly,  since  only  local  information  is  used  in  making 
the  updates.  Most  existing  systems  that  provide  powerful  opera¬ 
tions  such  as  routing  and  compaction  do  not  provide  inexpen¬ 
sive  updates:  small  changes  to  the  database  can  result  in  large 
amounts  of  recomputation.  Corner  stitching’s  combination  of 
powerful  operations  and  easy  updates  means  that  many  power¬ 
ful  tools  previously  available  only  in  “batch”  mode  can  now 
be  embedded  in  interactive  systems. 

II.  A  Simplified  Model  of  VLSI  Layouts 

A  VLSI  layout  is  normally  specified  as  a  hierarchical  collec¬ 
tion  of  cells,  where  each  cell  contains  geometrical  shapes  on 
several  mask  layers  and  pointers  to  subcells.  As  a  convenience 
in  presenting  the  data  structure  and  algorithms,  a  simplified 
model  is  used  in  this  paper.  There  is  only  a  single  mask  layer, 
and  hierarchy  is  ignored.  For  this  paper,  the  author  defines  a 
“circuit”  to  be  a  collection  of  rectangles.  There  is  a  single  de¬ 
sign  rule  in  the  model:  rectangles  may  not  overlap.  The  simpli¬ 
fied  model  makes  it  easier  to  present  the  data  structure  and  al¬ 
gorithms.  Section  VII  discusses  how  the  simple  model  can  be 
generalized  to  handle  real  VLSI  layouts. 

III.  Existing  Mechanisms 

3.1.  Linked  Lists 

The  simplest  possible  technique  for  representing  rectangles  is 
just  to  keep  alt  of  them  in  a  linked  list.  This  technique  is  used 
in  the  Caesar  system  [6j :  each  cell  is  represented  by  a  list  of 
rectangles  for  each  of  the  mask  layers.  Even  though  operations 
such  as  neighbor-finding  require  entire  lists  to  be  searched,  the 
structure  works  well  in  Caesar  for  two  reasons.  First,  large 
layouts  are  broken  down  hierarchically  into  many  small  cells; 
only  the  top-most  cells  in  the  hierarchy  ever  contain  more  than 
a  few  hundred  rectangles  or  a  few  children  (7] .  Second,  Caesar 
provides  only  very  simple  operations  like  painting  and  erasing. 
More  complex  functions  such  as  design  rule  checking  and  com¬ 
paction  could  not  be  implemented  efficiently  using  rectangle 
lists. 

3.2.  Bins 

The  most  popular  data  structures  for  VLSI  are  based  on  bins 
(2] .  In  bin-based  systems,  an  imaginary  square  grid  divides  the 
area  of  the  circuit  into  bins,  as  in  Fig.  1.  All  of  the  rectangles 
intersecting  a  particular  bin  are  linked  together,  and  a  two-di¬ 
mensional  array  is  used  to  locate  the  lists  for  different  bins. 
Rectangles  in  a  given  area  can  be  located  quickly  by  indexing 
Into  the  array  and  searching  the  (short)  lists  of  relevant  bins. 
The  bin  size  is  chosen  as  a  tradeoff  between  time  and  space: 
as  bins  get  larger,  it  takes  longer  to  search  the  lists  In  each  bin; 
as  bins  get  smaller,  rectangles  begin  to  overlap  several  bins  and 
hence  occupy  space  on  several  lists. 

Bin  structures  are  most  effective  when  rectangles  have  nearly 
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Fig.  1.  In  bin-based  data  structures,  the  circuit  is  divided  by  an  imag¬ 
inary  grid,  and  all  the  rectangles  intersecting  a  subarea  are  linked 
togelher. 


Fig.  2.  Neighbor  pointers  can  be  used  to  indicate  horizontal  or  vertical 
adjacency.  However,  if  lile  X  is  moved  right,  il  is  hard  lo  update  the 
vertical  pointers  without  scanning  the  entire  database. 

uniform  size  and  spatial  distributions;  they  suffer  from  space 
and/or  time  inefficiencies  when  these  conditions  are  not  met. 
A  pathological  case  is  a  cell  with  a  few  large  child  cells  and 
many  small  rectangles  to  interconnect  them.  If  bins  are  small, 
there  will  be  many  empty  bins  in  the  large  areas  of  the  subcells, 
resulting  in  wasted  space  for  the  bins;  if  bins  are  large,  the  bins 
in  the  wiring  area  will  have  many  rectangles,  resulting  in  slow 
searches.  Hierarchical  bin  structures  [4]  have  recently  been 
proposed  as  a  solution  to  the  problems  of  nonuniformity.  Al¬ 
though  bins  can  be  used  to  locate  all  the  objects  in  an  area, 
they  do  not  directly  embody  the  notion  of  nearness.  To  find 
the  nearest  object  to  a  given  one,  it  is  necessary  to  search  adja¬ 
cent  bins,  working  out  from  the  object  in  a  spiral  fashion.  Fur¬ 
thermore,  bin  structures  do  not  indicate  which  areas  of  the  chip 
are  empty;  empty  areas  must  be  reconstructed  by  scanning  the 
bins.  The  need  to  constantly  scan  bins  to  recreate  information 
makes  bin  structures  clumsy  at  best,  and  inefficient  at  worst, 
especially  for  operations  such  as  compaction  and  stretching. 

3.3.  Neighbor  Pointers 

A  third  class  of  data  structures  is  based  on  neighbor  pointers. 
In  this  technique,  each  rectangle  contains  pointers  to  rectangles 
that  are  adjacent  to  it  in  x  and  y  (see  Fig.  2).  Neighbor  point¬ 
ers  are  a  popular  data  structure  for  compaction  programs  such 
as  Cabbage  [3] ,  since  they  provide  information  about  relation¬ 
ships  between  objects.  For  example,  a  simple  graph  traversal 
can  be  used  as  part  of  compaction  to  determine  the  minimum 
feasible  width  of  a  cell. 
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Fig.  3.  An  example  of  liles  in  a  corncr-stilched  dala  struclure.  Solid 
tiles  are  represenled  with  dark  tines,  space  liles  wilh  doited  lines,  The 
enlire  area  of  the  circuil  is  covered  with  liles.  Space  liles  are  made 
as  wide  as  possible. 

Neighbor  pointers  have  two  drawbacks.  First,  modifications 
to  the  structure  generally  require  all  the  pointers  to  be  recom¬ 
puted.  For  example,  if  an  object  is  moved  horizontally,  as  in 
Fig.  2,  vertical  pointers  may  be  invalidated.  There  is  no  simple 
way  to  correct  the  vertical  pointers  short  of  scanning  the  entire 
database.  The  second  problem  with  neighbor  pointers  is  that 
they  provide  no  assistance  in  locating  empty  space  for  routing, 
since  only  the  occupied  space  is  represented  explicitly.  For 
these  two  reasons,  neighbor  pointers  do  not  appear  to  be  well- 
suited  to  interactive  systems  or  those  that  provide  routing  aids. 

IV.  Corner  Stitching 

Corner  stitching  arose  from  a  consideration  of  the  weaknesses 
of  the  above  mechanisms,  and  has  two  features  that  distinguish 
it  from  them.  The  first  important  feature  is  that  all  space,  both 
empty  and  occupied,  is  represented  explicitly  in  the  database. 
The  second  feature  is  a  novel  way  of  linking  together  the  ob¬ 
jects  at  their  corners.  These  corner  stitches  permit  easy  modi¬ 
fication  of  the  database,  and  lead  to  efficient  implementations 
for  a  variety  of  operations. 

Fig.  3  shows  four  objects  represented  in  the  corner  stitching 
scheme.  The  picture  resembles  a  mosaic  with  rectangular  tiles 
of  two  types,  space  and  solid.  The  tiles  must  be  rectangles  with 
sides  parallel  to  the  axes.  Tiles  contain  their  lower  and  left 
edges,  but  not  their  upper  or  right  edges,  so  every  point  in  the 
plane  is  present  in  exactly  one  tile.  The  entire  plane  is  covered 
from  -infinity  to  +infinity  in  both  x  and  y  (ir  practice,  the 
largest  representable  positive  and  negative  numbers  are  used 
for  the  infinities).  Coverage  to  infinity  is  achieved  by  extending 
the  outermost  space  tiles;  no  extra  tiles  are  required. 

The  space  tiles  are  organized  as  maximal  horizontal  strips. 
This  means  that  no  space  tile  has  other  space  tiles  immediately 
to  its  right  or  left.  When  modifying  the  database,  horizontally 
adjacent  space  tiles  must  be  split  into  shorter  tiles  and  then 
joined  into  maximal  strips,  as  shown  in  Fig.  4.  After  making 
sure  that  space  tiles  arc  as  wide  as  possible,  vertically  adjacent 
tiles  are  merged  together  if  they  have  the  same  horizontal  span. 
The  representation  of  space  is  of  no  consequence  to  the  VLSI 
layout  or  to  the  designer,  and  will  not  even  be  visible  in  real 
systems.  However,  the  maximal  horizontal  strip  representation 
is  crucial  to  the  space  and  time  efficiency  of  the  tools,  as  we 


(d) 

Hg.  4.  No  space  tile  may  have  another  space  lUc  to  ils  immediale  right 
or  left.  In  this  example,  tiles  A  and  B  in  (a)  must  be  split  into  the 
shorter  tiles  of  (b).  ihcn  merged  logeihcr  into  wide  strips  in  (c),  and 
finally  merged  vertically  in  (d). 

shall  see  in  Sections  V  and  VI.  Among  its  other  properties, 
the  horizontal-strip  representation  is  unique  there  is  one  and 
only  one  decomposition  of  space  for  each  arrangement  of 
solid  tiles. 

Tiles  are  linked  by  a  set  of  pointers  at  their  corners,  called 
corner  stitches.  Each  tile  contains  four  stitches,  two  at  its 
lower-left  corner  and  two  at  its  upper  right  corner, asillust rated 
in  Fig.  5.  Since  there  is  one  pointer  in  each  of  the  four  direc¬ 
tions,  the  stitches  provide  a  form  of  sorting  that  is  equivalent 
to  neighbor  pointers.  Originally,  eight  stitches  were  used,  two 
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Fig.  5.  Each  tile  is  connected  to  its  neighbors  by  four  pointers  caJJed 
corner  stitches.  The  names  of  the  stitches  indicate  the  tiles  they 
point  to:  the  tr  stitch  points  to  the  tile’s  topmost  right  neighbor,  the 
tb  stitch  points  to  the  tile’s  leftmost  bottom  neighbor,  and  so  on. 

at  each  of  the  four  corners,  but  four  turned  out  to  be  sufficient 
for  the  algorithms  presented  here.  The  choice  of  these  particu¬ 
lar  four  stitches  is  important. 

The  tile/stitch  representation  has  several  attractive  features, 
which  will  be  illustrated  in  the  sections  that  follow.  First,  the 
mechanism  combines  both  horizontal  and  vertical  pointers  in  a 
single  structure.  The  space  tiles  provide  a  form  of  registration 
between  the  horizontal  and  vertical  information  and  make  it 
easy  to  keep  all  the  pointers  up  to  date  as  the  circuit  is  modified. 
Because  the  space  tiles  may  vary  in  size  (as  opposed  to  fixed- 
size  bins),  the  structure  adapts  naturally  to  variations  in  the 
sizes  of  the  solid  tiles.  The  maximal  horizontal  strip  representa¬ 
tion  of  space  results  in  clean  upper  bounds  on  the  number  of 
space  tiles  and  also  on  the  complexity  of  the  algorithms.  All 
tiles  have  the  same  number  of  pointers  to  other  tiles,  so  they 
occupy  the  same  number  of  bytes  of  storage;  this  simplifies 
the  database  management  and  reduces  the  “constant  factors” 
In  algorithms. 

V.  Algorithms 

This  section  presents  algorithms  for  manipulating  the  tiles 
and  corner  stitches.  The  most  important  attribute  of  all  the 
algorithms  is  their  locality:  each  algorithm  depends  only  on 
information  in  the  immediate  vicinity  of  the  operation.  None 
of  the  algorithms  has  an  expected  running  time  any  worse  than 
linear  In  the  number  of  tiles  in  the  affected  area.  Pathological 
cases  will  be  shown  where  the  algorithms  require  time  linear, 
or  even  quadratic,  in  the  overall  layout  size,  but  in  practice 
(particularly  for  VLSI  layouts,  which  tend  to  be  densely  packed) 
their  running  times  are  small  and  independent  of  the  size  of 
the  layout. 

In  discussing  the  performance  of  the  algorithms,  the  corner 
stitches  provide  a  good  unit  of  measure.  The  complexity  of 
the  algorithms  will  be  discussed  in  terms  of  the  number  of 
stitches  that  must  be  traversed  (or,  alternatively,  the  number 
of  tiles  that  must  be  visited)  and/or  the  number  of  stitches 
that  must  be  modified. 

5.1.  Point  Finding 

Several  different  kinds  of  searching  are  facilitated  by  corner 
stitching.  One  of  the  most  common  operations  is  to  find  the 
tile  at  a  given  (x,  y)  location.  Fig.  6  illustrates  how  this  can 
be  done  with  corner  stitching.  The  algorithm  iterates  in  x  and 
y ,  starting  from  any  given  tile  in  the  database: 


Start 

Fig.  6.  To  locate  the  tile  containing  a  given  point,  alternate  between 
up/down  and  left/right  motions. 

1)  First  move  up  or  down,  using  right  top  (rt)  and  left  bottom 
(lb)  stitches,  until  a  tile  is  found  whose  vertical  range  contains 
the  desired  point. 

2)  Then  move  left  or  right,  using  tr  and  lb  stitches,  until  a 
tile  is  found  whose  horizontal  range  contains  the  desired  point. 

3)  Since  the  horizontal  motion  may  have  introduced  a  ver¬ 
tical  misalignment,  steps  1)  and  2)  may  have  to  be  iterated 
several  times  to  locate  the  tile  containing  the  point.  The  con¬ 
vexity  of  the  tiles  guarantees  that  the  algorithm  will  converge. 

In  the  worst  case,  this  algorithm  may  require  every  tile  in 
the  entire  structure  to  be  searched  (this  happens,  for  example, 
if  all  the  tiles  in  the  structure  are  in  a  single  column  or  row). 
Fortunately,  the  average  case  behavior  is  much  better  than  this. 
If  there  are  a  total  of  N  space  or  solid  tiles  and  they  are  of 
relatively  uniform  size,  then  on  the  order  of  y/N  tiles  will  be 
passed  through  in  the  average  case.  For  a  layout  containing  a 
million  tiles  (which  is  typical  of  the  fully  expanded  mask  sets 
of  current  VLSI  circuits),  this  means  a  few  thousand  tiles  will 
have  to  be  touched. 

In  interactive  systems,  there  is  a  simple  way  to  reduce  the 
time  spent  in  point  finding:  keep  a  pointer  around  to  any  tile 
in  the  approximate  area  where  the  designer  is  working.  When 
a  large  design  is  being  edited,  the  designer’s  attention  is  gener¬ 
ally  focused  on  a  small  piece  of  the  design  (e.g.,  a  piece  that 
can  be  viewed  comfortably  on  a  graphic  device).  If  a  hint  tile 
in  this  area  is  remembered  for  reference,  the  search  time  de¬ 
pends  only  on  how  much  is  on  the  screen,  not  how  large  the 
design  is. 

The  point-finding  algorithm  illustrates  a  general  feature  of 
most  of  the  algorithms:  misalignment.  While  searching  hori¬ 
zontally,  it  is  possible  to  lose  the  vertical  alignment,  so  the 
algorithm  must  iterate  over  horizontal  and  vertical  motions. 
See  Fig.  6  for  an  example.  In  general,  large  tiles  can  cause  the 
algorithms  of  this  paper  to  wander  arbitrarily  far  outside  their 
areas  of  Interest.  When  this  happens,  the  algorithms  must 
traverse  stitches  to  get  back  to  the  desired  area  again.  Extreme 
misalignment  results  in  worst-case  behavior  for  many  of  the 
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I-'ig.  7.  The  corner  stitches  provide  a  simple  way  to  find  all  the  tiles 
that  touch  one  side  of  a  given  tile. 

algorithms.  Fortunately,  severe  misalignment  is  unlikely  for 
densely  packed  designs. 

5.2.  Neighbor  Finding 

Another  common  searching  operation  is  neighbor  finding: 
find  all  the  tiles  that  touch  one  side  of  a  given  tile.  Neighbor 
finding  is  useful  for  design  rule  checking,  compaction,  circuit 
extraction,  and  tracing  out  connected  nets.  Fig.  7  illustrates 
how  to  find  all  the  tiles  that  touch  the  right  side  of  a  given 
tile: 

1)  Follow  the  tr  stitch  of  the  starting  tile  to  find  its  topmost 
right  neighbor. 

2)  Then  trace  down  through  lb  stitches  until  all  the  neighbors 
have  been  found  (the  last  neighbor  is  the  first  tile  encountered 
whose  lower  y  coordinate  is  less  than  or  equal  to  the  lower 
y  coordinate  of  the  starting  tile). 

Similar  algorithms  can  be  devised  to  search  each  of  the  other 
sides.  The  time  for  the  search  is  linear  in  the  number  of  neigh¬ 
bors.  As  shown  in  Appendix  1,  the  expected  number  of  neigh¬ 
bors  is  one  or  two  along  each  side.  In  layouts  where  tile  sizes 
vary  greatly,  the  number  of  neighbors  will,  on  average,  be 
proportional  to  the  length  of  the  side. 

5.3.  Area  Searches 

A  third  form  of  searching  is  to  sec  if  there  arc  any  solid  tiles 
within  a  given  area.  This  can  be  accomplished  in  the  following 
manner  using  corner  stitches  (sec  Fig.  8): 

1 )  Use  the  point-finding  algorithm  to  locate  the  tile  contain¬ 
ing  the  upper  left  corner  of  the  area  of  interest. 

2)  Sec  if  the  tile  is  solid.  If  not.  it  must  be  a  space  tile.  Sec 
if  its  right  edge  is  within  the  area  of  interest.  If  so,  it  is  the  edge 
of  a  solid  tile. 

3)  If  a  solid  tile  was  found  in  step  2),  then  the  search  is  com¬ 
plete.  If  no  solid  tile  was  found,  then  move  down  to  the  next 
tile  touching  the  right  edge  of  the  area  of  interest.  This  can 
be  done  cither  by  invoking  the  point-finding  algorithm,  or  by 
traversing  the  lb  stitch  down  and  then  traversing  tr  stitches 
right  until  the  desired  tile  is  found 

4)  Repeat  steps  2)  and  3)  until  cither  a  solid  tile  is  found  or 
the  bottom  of  'lie  area  of  interest  is  reached. 

As  with  the  other  operations,  the  time  necessary  for  this 
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Fig.  8.  To  search  a  rectangular  area  for  a  solid  tile,  work  down  along  the 
left  edge  of  the  area.  Each  tile  along  the  edge  must  be  either  a  solid 
tile,  a  space  tile  that  spans  the  entire  area,  or  a  space  tile  with  a  solid 
tile  just  to  its  right. 

operation  depends  only  on  local  features:  the  number  of  tiles 
in  and  around  the  area  of  interest.  The  cost  can  be  measured 
by  counting  the  number  of  stitches  that  must  be  traversed. 
The  number  of  iterations  through  the  algorithm  will  be  pro¬ 
portional  to  the  height  of  the  area  (assuming,  as  always,  a  rela¬ 
tively  uniform  size  distribution).  In  each  iteration,  it  may  be 
necessary  to  traverse  one  stitch  in  step  2).  In  addition,  step  3) 
will  cause  a  misalignment  of  about  1/2  tile  in  the  average  case. 
Thus  the  total  running  time  is  linear  in  the  height  of  the  search 
area,  and  docs  not  depend  at  all  on  the  width  of  the  search 
area.  In  worst -case  situations  like  the  one  shown  in  Fig.  9(a), 
misalignments  could  cause  the  running  time  to  be  proportional 
to  the  total  number  of  tiles  in  the  layout. 

5. 4.  Directed  A  rca  Enumeration 
The  algorithm  in  Section  5.3  determines  if  there  are  any  solid 
tiles  in  an  area.  However,  for  many  applications,  such  as  com¬ 
paction  and  layout  rule  checking,  it  is  useful  to  enumerate  all 
the  tiles  in  a  given  area,  i.e,,  to  "visit”  each  tile  exactly  once. 
Furthermore,  it  is  often  useful  to  do  this  in  a  particular  direc¬ 
tion.  For  example,  during  a  left-to-right  compaction.it  is  im¬ 
portant  that  a  tile  not  be  processed  until  all  tiles  on  its  left  have 
been  processed.  This  section  presents  an  algorithm  wherein 
each  tile  is  visited  only  after  all  the  tiles  above  it  and  to  its 
left  have  been  visited.  1  call  such  an  enumeration  a  directed 
enumeration.  Corner  stitching  makes  this  a  linear  time  opera¬ 
tion.  Fig.  10  shows  the  enumeration  order  for  an  example 
case. 

1)  As  for  the  area-searching  algorithm,  use  the  point-finding 
algorithm  to  locale  the  tile  at  the  top  left  corner  of  the  area  of 
interest.  Then  step  down  through  all  the  tiles  along  the  left 
edge,  using  the  same  technique  as  in  area  searching. 

2)  For  each  tile  found  in  step  1),  enumerate  it  recursively 
using  the  R  procedure  given  in  lines  R1 )  through  R5). 

Rl)  (-.numerate  the  tile  (this  will  generally  involve  some 
application-specific  processing). 

R2)  If  the  right  edge  of  the  tile  is  outside  of  the  search 
area,  then  return  from  the  R  procedure. 

R3)  Otherwise,  use  the  neighbor-finding  algorithm  to  lo¬ 
cate  all  the  tiles  that  touch  the  right  side  of  the  current  tile 
and  also  intersect  the  search  area. 
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Fig.  9.  Two  pathological  structures.  In  (a),  area  searches  of  the  dashed 
area  are  slow  because  of  severe  misalignment  during  step  3)  of  the 
algorithm.  It  is  also  slow  to  create  a  tile  in  the  dashed  area  at  (a), 
which  produces  the  situation  in  (b),  or  delete  the  labeled  tile  in  (b) 
to  get  back  the  situation  in  (a):  when  splitting  and  merging  space 
tiles,  corner  stitches  must  be  modified  in  every  solid  tile  in  the  cir¬ 
cuit. 


Fig.  10.  An  example  of  directed  enumeration.  When  doing  an  upper 
left  to  lower  right  enumeration  of  the  dashed  area,  the  tiles  will  be 
visited  in  order  of  their  numbers. 

R4)  For  each  of  these  neighbors,  If  the  bottom  left  comer 
of  the  neighbor  touches  the  current  tile  then  call  R  to  enu¬ 
merate  the  neighbor  recursively  (for  example,  this  occurs  in 
Fig.  10  when  tile  1  is  the  current  tile  and  tile  2  is  the  neigh¬ 
bor). 

RS)  Or,  if  the  bottom  edge  of  the  search  area  cuts  both 
the  current  tile  and  the  neighbor,  then  call  R  to  enumerate 
the  neighbor  recursively  (in  Fig.  10,  this  occurs  when  tile  8 
Is  the  current  tile  and  tile  9  Is  the  neighbor). 

The  expected  running  time  of  the  directed  enumeration  al¬ 
gorithm  is  linear  in  the  number  of  tiles  intersecting  the  search 
area.  This  can  be  shown  by  the  following  arguments.  The 


checks  in  steps  R4)  and  R5)  guarantee  that  each  tile  is  enu¬ 
merated  exactly  once.  However,  a  tile  may  be  checked  several 
times  before  satisfying  the  checks  in  step  R4)  or  R5),  it  will  be 
checked  once  for  each  tile  that  touches  its  left  side  The  total 
expected  running  time  of  the  algorithm  is  thus  proportional  to 
the  total  number  of  adjacencies  within  the  search  area.  Ap¬ 
pendix  1  uses  the  properties  of  planar  graphs  to  prove  that  the 
number  of  adjacencies  must  be  linear  in  the  number  of  tiles. 

In  the  worst  case,  directed  area  enumeration  could  require 
every  tile  in  the  circuit  to  be  examined.  This  happens  if  tiles 
stick  out  far  above  the  top  edge  of  the  area  being  enumerated: 
all  of  their  neighbors  must  be  enumerated  in  step  R3),  even 
though  most  of  them  do  not  intersect  the  area  of  interest. 

The  algorithm  for  directed  enumeration  does  not  depend  on 
the  fact  that  space  tiles  are  maximal  horizontal  strips.  In  fact, 
it  does  not  even  distinguish  between  solid  and  space  tiles.  A 
similar  algorithm  can  be  devised  to  reverse  the  direction  of 
enumeration  (from  lower  right  to  upper  left).  But,  it  is  much 
more  difficult  to  recode  the  algorithm  to  operate  from  lower 
left  to  upper  right,  or  from  upper  right  to  lower  left  (this  is 
because  there  are  no  corner  stitches  emanating  from  the  lower 
right  or  upper  left  corners  of  tiles). 

5.5  Tile  Creation 

The  first  step  in  creating  a  new  solid  tile  is  to  check  to  see 
that  there  are  no  existing  solid  tiles  in  the  desired  area  of  the 
new  tile.  The  area-search  algorithm  can  check  this.  The  sec¬ 
ond  step  is  to  insert  the  tile  into  the  data  structure,  clipping 
and  merging  space  tiles  and  updating  corner  stitches  as  shown 
in  Fig.  11.  The  insertion  algorithm  is  as  follows: 

1)  Find  the  space  tile  containing  the  top  edge  of  the  area 
to  be  occupied  by  the  new  tile  (because  of  the  strip  property, 
a  single  space  tile  must  contain  the  entire  edge). 

2)  Split  the  top  space  tile  along  a  horizontal  line  into  a  piece 
entirely  above  the  new  tile  and  a  piece  overlapping  the  new 
tile.  Update  corner  stitches  in  the  tiles  adjoining  the  new  tile. 

3)  Find  the  space  tile  containing  the  bottom  edge  of  the 
new  solid  tile,  split  it  in  the  same  fashion,  and  update  stitches 
around  it. 

4)  Work  down  along  the  left  side  of  the  area  of  the  new 
tile,  as  for  the  area-search  algorithm.  Each  tile  along  this  edge 
must  be  a  space  tile  that  spans  the  entire  width  of  the  new 
solid  tile.  Split  the  space  tile  into  a  piece  entirely  to  the  left 
of  the  new  tile,  a  piece  entirely  to  the  right  of  the  new  tile, 
and  a  piece  entirely  within  the  new  tile.  This  splitting  may 
make  it  possible  to  merge  the  left  and  right  remainders  verti¬ 
cally  with  the  tiles  just  above  them:  merge  whenever  possible. 
Finally,  merge  the  center  space  tile  with  the  solid  tile  that  is 
forming.  Each  split  or  merge  requires  stitches  to  be  updated 
in  adjoining  tiles. 

The  speed  of  the  creation  algorithm  is  determined  by  the 
cost  of  splitting  and  merging  the  space  tiles  that  cross  the  area. 
The  number  of  space  tiles  depends  on  the  number  of  solid 
tiles  in  the  left  and  right  shadows  of  the  new  tile.  One  can  de¬ 
vise  cases  where  the  number  of  space  tiles  is  arbitrarily  high, 
but  In  practice,  the  expected  number  is  proportional  to  the 
relative  height  of  the  new  tile  in  comparison  to  the  tiles  around 
it.  Appendix  B  discusses  the  cost  of  splitting  and  merging 
tiles.  In  the  average  case  it  is  constant;  for  very  large  tiles  it 
Is  proportional  to  the  circumference  of  the  tile.  This  means 


OUSTERHOUT.  CORNER  STtTCHtNG :  DATA  STRUCTURING  TECHNIQUE 


93 


(»)  (c) 


(b)  (d) 

Fig.  1 1.  Inserting  a  new  solid  tile  into  the  data  structure,  (a)  shows  the  desired  location  of  the  new  tile.  In  (b)  the  space 
tiles  containing  the  top  and  bottom  edges  of  the  new  solid  tile  are  split.  In  (c)  and  (d)  the  area  of  the  new  tile  is  traversed 
from  top  to  bottom,  splitting  and  joining  space  tiles  on  either  side  and  pointing  their  stitches  at  the  new  solid  tile. 


that  in  the  worst  possible  case,  the  cost  of  creating  a  new  tile 
could  be  proportional  to  the  total  numbet  of  tiles  in  the  lay¬ 
out  (see  Fig.  9(a)).  In  the  average  case,  the  running  time  is  con¬ 
stant  if  the  new  tile  is  about  the  same  size  as  the  tiles  around 
it;  if  the  new  tile  is  much  larger  than  its  neighbors,  then  the 
running  time  is  proportional  to  the  height  of  the  new  tile  and 
independent  of  its  width. 

5.6  Tile  Deletion 

Tile  deletion  is  complicated  by  the  need  to  split  and  merge 
space  tiles  so  as  to  maintain  the  horizontal-strip  representa¬ 
tion.  The  algorithm  below  works  in  a  mostly  clockwise  fash¬ 
ion  around  the  tile  being  deleted,  which  is  referred  to  as  the 
dead  tile.  See  Fig.  12  for  an  example. 

1)  Change  the  type  of  the  dead  tile  to  "space”. 

2)  Use  the  neighbor-finding  algorithm  to  search  from  top  to 
bottom  through  all  the  tiles  that  adjoin  the  right  edge  of  the 
dead  tile. 

3)  For  each  space  tile  found  in  step  2),  split  either  the 
neighbor  or  the  dead  tile,  or  both,  so  that  the  two  tiles  have 
the  same  vertical  span,  then  merge  the  tiles  together  horizon¬ 
tally. 

4)  When  the  bottom  edge  of  the  original  dead  tile  is  reached, 
scan  upwards  along  the  left  edge  of  the  original  dead  tile  to 
find  all  the  space  tiles  that  are  left  neighbors  of  the  original 
dead  tile. 

5)  For  each  space  tile  found  in  step  4),  merge  the  space  tile 
with  the  adjoining  remains  of  the  original  dead  tile.  Do  this 
by  repeating  steps  2)-3),  treating  the  current  space  tile  like  the 
dead  tile  in  steps  2)-3). 

6)  It  is  also  necessary  to  do  vertical  merging  in  step  5). 
After  each  horizontal  merge  in  step  5),  check  to  see  if  the  re¬ 


sult  tile  can  be  merged  with  the  tiles  just  above  and  below  it, 
and  merge  if  possible. 

As  with  the  other  algorithms,  deletion  could  require  a  great 
deal  of  time  in  pathological  cases.  For  example,  Fig.  9(b) 
shows  a  situation  where  corner  stitches  will  have  to  be  exam¬ 
ined  and  modified  in  every  single  tile  in  the  layout,  so  running 
time  will  be  proportional  to  the  overall  layout  size.  However, 
situations  like  this  are  not  likely  in  integrated  circuits.  If  the 
tiles  are  roughly  uniform  in  size  and  distribution,  then  the 
number  of  splits  and  joins  will  be  constant  and  the  running 
time  will  also  be  constant.  When  a  large  tile  is  being  deleted, 
the  running  time  will  be  proportional  to  the  number  of  left 
and  right  neighbors  of  the  tile,  which  is  proportional  to  the 
tile’s  height. 

5.  7  Plowing 

Plowing  is  an  example  of  an  important  operation  that  can¬ 
not  easily  be  implemented  with  most  existing  data  structures. 
When  one  piece  of  a  large  design  is  moved,  it  is  often  desirable 
for  other  pieces  of  the  design  lying  in  the  path  of  motion  to 
move  as  well,  as  if  the  original  piece  were  a  plow.  Ideally,  such 
a  motion  will  stretch  or  shrink  the  design  while  maintaining 
design  rules  and  connectivity.  Plowing  can  be  accomplished 
with  corner  stitching  in  the  following  way: 

1)  Determine  the  rectangular  area  that  will  be  swept  out  by 
the  motion  of  the  original  tile  (see  Fig.  13). 

2)  Use  the  area-finding  algorithm  to  see  if  there  are  any 
solid  tiles  in  the  plow  area.  If  a  solid  tile  is  found,  invoke  the 
plow  algorithm  recursively  to  move  the  tile  out  of  the  plow 
area.  Repeat  this  step  until  no  solid  tiles  are  found. 

3)  Delete  the  original  tile  from  its  old  location  and  create 
it  at  the  new  position. 
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Fig.  12.  An  example  of  tQe  deletion.  In  each  figure,  tile  X  is  the  next  one  to  be  processed.  (a)  Shows  the  initial  tile  ar¬ 
rangement  and  the  clockwise  order  in  which  stitches  will  be  traversed  around  the  dead  tQe  to  merge  it  with  adjacent 
space  tiles.  In  (b)  the  downward  sweep  along  the  right  edge  has  been  completed  (note  that  the  left  edge.of  the  dead 
tile  is  stdl  intact).  In  (c)  the  upward  sweep  along  the  left  edge  is  partially  complete,  and  (d)  shows  the  final  situation. 


(c) 


Fig.  13.  An  example  of  plowing:  (a)  determine  the  area  to  be  swept 
out  by  the  motion;  (b)  recursively  move  all  solid  tiles  out  of  this 
area;  and  (c)  move  the  original  tQe, 


Fig.  14.  Using  a  top-to-bottom  area  search  with  the  simple  plowing 
algorithm,  this  structure  will  cause  the  rightmost  tUes  to  be  moved 
many  times  when  the  cross-hatched  tQe  is  plowed  to  the  right.  Total 
running  time  will  be  exponential  in  the  circuit  size. 


Unfortunately,  this  simple  algorithm  suffers  from  terrible 
worst-case  behavior.  Laitice  structures  like  the  one  in  Fig.  14 
can  require  up  to  2N  recursive  tile  moves  to  clear  N  tiles  out 
of  the  plow  area.  It  seems  likely  that  structures  similar  to  the 
one  In  Fig.  14  may  occur  in  actual  circuits.  Fortunately,  the 
algorithm  can  be  made  to  run  in  linear  expected  time  by  order¬ 
ing  the  recursive  processing  so  that  a  tile  is  not  moved  until  its 
final  position  is  known  (i.e.,  it  is  not  processed  until  all  the 
tiles  that  can  affect  its  final  position  have  been  processed). 
The  code  is  somewhat  complex,  and  is  different  for  horizontal 
plowing  than  for  vertical  plowing.  Appendixes  C  and  D  de¬ 
velop  the  linear  time  algorithm  in  detail.  In  the  worst  case, 
the  algorithms  of  Appendixes  C  and  D  could  require  AAV  time, 
where  M  is  the  lotai  number  of  tiles  that  have  to  be  moved 
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(b) 


Fig.  IS.  To  compact  a  layout  horizontally,  plow  a  large  additional 
tile  (cross-hatched  in  the  figure)  across  the  layout:  (a)  shows  The 
configuration  before  the  plow,  and  (b)  shews  the  compacted  con¬ 
figuration  afterwards.  The  tile  acts  like  a  broom  and  compacts  as 
il  sweeps. 

and  N  is  the  size  of  the  circuit.  In  the  average  case  they  re¬ 
quire  time  linear  in  M. 

5.8  Compaction 

Most  existing  algorithms  for  compaction  require  N*  time  in 
the  worst  case  for  a  layout  containing  N  elements,  and  have 
been  empirically  observed  to  have  average  running  time  close 
to  Af1,1  [8).  With  corner  stitching,  compaction  is  linear  in 
the  size  of  the  layout.  Compaction  in  a  single  direction  can  be 
achieved  in  a  simple  way  by  plowing  a  large  tile  across  the  lay¬ 
out,  as  shown  in  Fig.  15.  The  linear  expected  time  for  plowing 
results  in  linear  expected  time  for  compaction.  The  worst -case 
compaction  time  is  still  N 1  using  comer  stitching. 

There  are  two  keys  to  the  speed  of  compaction  in  corner 
stitching.  The  first,  and  most  important,  is  that  all  the  de¬ 
pendencies  between  tiles  are  maintained  dynamically.  In  other 
compaction  systems,  the  dependencies  must  be  reconstructed 
after  each  change  to  the  layout;  the  algorithms  for  generating 
dependencies  limit  the  overall  speed  of  compaction.  The  sec¬ 
ond  key  is  that  the  layout  is  planar.  This  means  that  the  num¬ 
ber  of  adjacencies  is  linear  in  the  number  of  tiles,  and  hence, 
the  whole  layout  can  be  scanned  in  time  proportional  to  the 
number  of  tiles. 

5.9  Channel  Finding 

Channel  information  Is  constantly  available  in  the  form  of 
the  space  tiles.  The  corner  stitches  make  it  possible  to  find 
connected  channels  and  thereby  trace  out  signal  paths.  Of 


TABLE i 

Corner  stitching  requires  about  40  percent  more  storage  per  tile  than 
linked  list  systems  like  Caesar  Only  the  lower  and  lelt  coordinates  ol 
each  tile  need  be  stored  in  corner  stitching,  since  the  upper  and  right 
coordinates  can  be  gotten  by  examining  the  lower  and  left  coordinates 
of  neighboring  tiles 


Caesar 

Corner  Stitching 

Coordinates 

*ryi,Jt2-y2 

*,.y, 

(0  bytes) 

j  Pointers 

\  Unk 
(4  bytes) 

4  stitches 
(16  bytes) 

1  Tile  Type 

not  needed 

(4  bytes) 

!  Total 

20  bytes 

29  bytes 

course,  some  routers  may  prefer  a  different  representation  of 
channels  than  maximal  horizontal  strips;  if  this  is  the  case, 
then  conversion  will  be  necessary  to  cast  the  space  tiles  into  a 
form  suitable  for  routing. 

VI.  Space  Requirements 

Because  of  the  enormous  size  of  VLSI  designs,  a  data  struc¬ 
ture  used  for  VLSI  CAD  must  be  space  efficient  if  it  is  to  be 
effective.  For  example,  the  hierarchical  representation  of  a 
45  000-transistor  chip  requires  about  1.5  X  10*  bytes  of  main 
memory  in  Caesar.  Corner  stitching  requires  more  information 
to  be  kept  in  the  data  structure  than  systems  like  Caesar. 
Table  1  compares  corner  stitching  to  the  linked-list  scheme  of 
Caesar.  Corner  stitching  requires  three  more  pointers  than 
Caesar,  plus  »  type  field  (in  linked-list  systems  all  the  tiles  on 
a  given  list  are  of  the  same  type).  Corner  stitching  saves  space 
by  storing  only  the  lower  and  left  coordinates  of  each  tile,  in¬ 
stead  of  four  coordinates;  the  upper  and  right  coordinates  of 
a  tile  can  be  gotten  from  the  lower  and  left  coordinates  of 
neighboring  tiles.  As  a  result,  corner-stitched  tiles  are  about 
40  percent  larger  than  Caesar  tiles.  In  addition,  there  are 
tr. any  more  tiles  in  corner  stitching  than  in  other  systems  since 
corner  stitching  requires  empty  space  to  be  represented.  If 
there  arc  many  space  tiles,  then  corner  stitching  will  require 
too  much  space  to  be  practical.  Furthermore,  most  of  the 
algorithms  depend  on  the  total  number  of  tiles  in  an  area,  in¬ 
cluding  both  space  and  solid  tiles;  if  there  are  many  space  tiles, 
the  algorithms  will  be  inefficient. 

In  a  circuit  with  N  solid  tiles,  there  will  never  be  more  than 
5N  +  I  space  tiles.  Furthermore,  the  horizontal-strip  repre¬ 
sentation  is  at  least  as  efficient  (in  the  worst  case)  as  any  other 
rectangle-based  representation  of  space.  In  actual  circuit  lay¬ 
outs,  the  number  of  space  tiles  is  about  equal  to  the  number 
of  solid  tiles. 

The  proof  of  the  3N  +  1  upper  limit  is  due  to  C.  Sequin.  To 
see  that  no  more  than  3/V  +  1  space  tiles  are  needed  for  N  solid 
tiles,  place  the  solid  tiles  one  at  a  time  in  order  from  right  to 
left  as  shown  in  Fig.  16.  Initially  there  is  a  single  space  tile. 
When  each  solid  tile  is  placed,  it  can  result  in  no  more  than 
three  new  space  tiles;  the  top  and  bottom  edges  may  each 
cause  a  space  tile  to  be  split,  and  a  new  space  tile  will  be 
created  in  the  shadow  to  the  left  of  the  solid  tile.  Because 
we  place  the  solid  tiles  in  order,  there  can  be  no  solid  tiles 
in  the  shadow.  This  means  that  only  a  single  space  tile  will 
be  created  there.  Although  the  solid  tiles  were  placed  in  a 
particular  order  to  demonstrate  the  3N  +  I  limit,  the  final 
configuration  is  independent  of  the  order  in  which  the  tiles 
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Fig.  17.  In  pathological  situations  where  no  two  solid  tiles  have  colin- 
ear  edges,  at  least  3.V  +  ]  tiles  must  be  used  to  represent  space,  re¬ 
gardless  of  whether  or  not  horizontal  strips  are  used. 


TABLE  11 

For  actual  layouts,  corner  stitching  requires  about  one  space  tile  for 
each  solid  tile.  The  First  case  consists  of  all  the  global  routing  for  the 
RISC  1  microprocessor  (i.e..  all  the  rectangles  in  the  topmost  cell  of 
the  hierarchy.  The  routing  is  sparse.  The  second  and  third  cases 
consist  of  cells  of  another  microprocessor  under  development. 


Fig.  16.  Both  (a)  and  (b)  show  that  if  solid  tiles  are  inserted  in  order 
from  right  to  left,  each  tile  causes  no  more  than  three  additional 
space  tiles  to  be  created.  However,  if  edges  of  the  new  tile  align  with 
edges  of  old  tiles,  an  in  (c),  less  than  three  additional  space  tiles  will 
be  required. 

are  placed  (the  horizontal-strip  property  guarantees  this). 
Thus  the  result  is  valid  regardless  of  the  order  of  solid-tile 
creation. 

There  are  many  other  ways  to  organize  space  tiles  besides 
maximal  horizontal  strips.  However,  in  the  worst  case,  no 
representation  of  space  can  use  less  than  3jV  +  1  space  tiles. 
This  worst  case  occurs  when  no  two  solid  tiles  have  colinear 
edges.  Fig.  17  shows  one  such  situation. 

Substantially  fewer  than  3 N  +  1  space  tiles  are  needed 
for  actual  VLSI  applications.  Fewer  space  tiles  are  needed 
whenever  edges  of  neighboring  solid  tiles  align.  For  example, 
Fig.  16(c)  shows  a  situation  where  the  placement  of  a  solid 
tile  only  adds  two  space  tiles  instead  of  three.  In  integrated 
circuits,  the  solid  tiles  must  touch  each  other  to  achieve  elec¬ 
trical  connectivity,  so  the  number  of  space  tiles  actually 
needed  is  much  less  than  3/V.  Table  11  shows  sample  data 


1  Circuit  1  Solid  Tiles  Space  Tiles 
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gathered  from  three  cells  using  a  layout  editor  based  on  corner 
stitching.  On  the  average,  about  one  space  tile  is  required  for 
each  solid  tile.  This  means  that  the  total  storage  required  for 
geometry  in  corner  stitching  will  be  between  two  and  a  half 
and  three  times  as  great  as  in  systems  like  Caesar.  This  result 
applies  even  when  the  mask  layers  are  sparse,  as  in  the  global 
routing  example. 

VII.  Using  Corner  Stitching  for  Real  VLSI 
The  scheme  presented  here  must  be  extended  in  several  ways 
to  make  it  practical  for  real  integrated  circuits.  This  section 
presents  some  of  the  important  issues  and  discusses  possible  so¬ 
lutions.  To  date,  there  have  been  two  implementations  of 
corner  stitching.  A  toy  implementation  was  built  using  ex¬ 
actly  the  model  and  algorithms  of  this  paper,  in  order  to  test 
the  basic  viability  of  the  ideas.  About  1100  lines  of  C  code 
were  required  to  implement  all  the  algorithms,  including  com¬ 
paction,  and  for  small  test  cases  (100  tiles)  response  was  in¬ 
stantaneous  for  all  operations. 
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As  a  result  of  the  successful  toy  implementation,  we  have 
undertaken  the  development  of  a  full-fledged  VLSI  layout 
editor  based  on  corner  stitching.  It  has  just  recently  become 
operational.  Although  stretching  and  compaction  have  not 
been  implemented  yet,  the  current  system  is  at  least  as  power¬ 
ful  as  its  predecessor,  Caesar,  and  is  being  used  by  chip  de¬ 
signers  at  the  University  of  California  at  Berkeley. 

The  first  generalization  of  the  simple  scheme  is  to  provide 
for  multiple  mask  layers.  There  are  several  ways  to  accom¬ 
plish  this.  One  alternative  is  to  permit  many  different  types 
of  solid  tiles,  one  type  for  each  possible  combination  of  mask 
layers.  Unfortunately,  this  scheme  will  result  in  enormous 
numbers  of  tiny  tiles  in  places  where  several  mask  layers  cross 
each  other.  Many  of  the  layer  crossings  are  not  relevant,  so 
the  fragmentation  of  the  tile  structure  wastes  space  unneces¬ 
sarily  (for  example,  it  doesn’t  matter  where  metal  crosses  poly¬ 
silicon  or  diffusion,  unless  there  are  contact  cuts  present). 
Another  alternative  is  to  keep  a  separate  comer-stitched 
“plane”  for  each  mask  layer.  This  scheme  will  be  relatively 
space  efficient,  but  will  require  frequent  cross-registration  be¬ 
tween  planes  during  operations  such  as  plowing  and  design- 
rule  checking  that  deal  with  layer  interactions. 

For  our  layout  editor  based  on  corner  stitching,  we  used  a 
combination  of  these  two  schemes.  The  polysilicon,  diffusion, 
and  implant  layers  are  kept  together  in  a  single  corner-stitched 
plane  with  different  types  of  solid  tiles  for  each  layer  combi¬ 
nation.  This  makes  sense  because  most  of  the  different  combi¬ 
nations  of  these  layers  are  distinct  electrically.  Each  metal 
layer  is  kept  in  its  own  corner-stitched  plane,  since  they  inter¬ 
act  only  weakly  with  each  other  and  with  the  rest  of  the  cir¬ 
cuit.  Because  contacts  provide  a  connection  between  layers, 
they  are  duplicated  in  each  of  the  planes  that  they  connect. 
Under  this  scheme,  the  corner-stitched  representation  corre¬ 
sponds  almost  exactly  to  the  electrical  circuit,  since  the  tran¬ 
sistors  (combinations  of  polysilicon  and  diffusion  and  im¬ 
plants)  are  represented  by  special  tile  types.  Furthermore, 
this  particular  division  of  mask  layers  among  planes  allows 
each  plane  to  be  design-rule  checked  independently. 

To  handle  hierarchical  designs,  our  layout  editor  keeps  a 
separate  set  of  tile  planes  for  each  cell  in  the  design.  An 
additional  corner-stitched  plane  per  cell  is  used  to  keep  track 
of  the  cell’s  subcells.  A  different  tile  type  is  used  in  this  plane 
for  each  distinct  subcell  or  overlap  area  between  subcells. 

Design-rule  checking  is  trivial  in  the  simple  model.  The  only 
design  rule  is  that  there  can  be  no  solid-tile  overlap;  this  con¬ 
dition  is  enforced  by  the  creation  and  plowing  routines.  In 
actual  IC  designs,  the  design  rules  will  include  more  complex 
spacing  and  separation  rules  that  are  different  for  different 
tile  types.  For  the  corner-stitched  editor,  we  have  imple¬ 
mented  a  simple  design-rule  checker  similar  to  Lyra  [1 J  except 
that  it  is  edge  based  instead  of  corner  based,  it  scans  a  corner- 
stitched  plane,  generates  constraints  at  each  edge  based  on  the 
tile  types  on  either  side  of  the  edge,  and  uses  area  enumeration 
to  check  the  constraints.  To  handle  areas  of  overlap  between 
subcells,  the  design-rule  checker  extracts  information  from  the 
separate  planes  of  the  subcells  into  an  auxiliary  corner-stitched 
structure  and  then  checks  the  auxiliary  structure. 

The  plow  algorithm  is  also  affected  by  more  complex  design 


TABLE  III 

Tvpk  \i.  and  Worst-Cask  Running  Timks  for  tiif  Algorithms 
M  refers  to  the  number  of  tiles  of  direct  interest  to  the  algorithm  (e.g., 
the  numher  of  tiles  being  enumerated  in  area  enumeration,  or  the 
number  of  tiles  removed  in  plowing).  V  refers  to  the  total  number  of 


tiles  in 

the  circuit. 

Algorithm 

Expected  Time 

Worst-case  Time 

Point  Search 

v77 

N 

Point,  Search  (with  hint) 

constant 

N 

Neighbor  Search 

u 

M 

Area  Search 

M 

N 

Directed  Area  Enumeration 

u 

N 

Tile  Creation 

constant 

N 

Tile  Deletion 

constant 

N 

Plowing 

u 

UN 

Compaction 

N 

N* 

rules,  and  must  deal  with  connectivity  as  well.  Although  the 
implementation  of  plowing  is  not  yet  complete,  real  VLSI  de¬ 
sign  rules  appear  to  be  accomodated  by  selectively  expanding 
the  plow  area  to  maintain  proper  spacings.  For  example,  if 
the  metal-metal  spacing  must  be  three  units,  then,  when  plow¬ 
ing  a  metal  tile,  all  unrelated  metal  must  be  cleared  from  an 
area  three  units  larger  on  all  sides  than  the  area  swept  out  by 
the  tile’s  motion.  Connectivity  appears  to  be  handled  by  se¬ 
lectively  stretching  or  shrinking  some  tiles,  rather  than  moving 
tiiem. 

In  some  industrial  environments,  the  Manhattan  restriction 
may  be  intolerable.  Where  this  is  the  case,  it  may  be  possible 
to  accomodate  45°-angles  by  using  trapezoids  instead  of  rec¬ 
tangles.  Degenerate  trapezoids  can  be  used  to  represent  tri¬ 
angles.  We  do  not  plan  to  implement  non-Manhattan  features 
in  our  system,  since  in  our  environment,  the  Manhattan  re¬ 
striction  is  acceptable  (and  even  desirable,  since  it  tends  to 
simplify  designs  and  make  tools  run  two  to  ten  times  faster). 
The  Manhattan  design  style  seems  to  be  gaining  more  and 
more  acceptance  in  the  integrated  circuit  design  community 
as  a  whole.  For  example,  the  Caesar  editor,  which  is  Manhat¬ 
tan,  is  now  being  used  at  nearly  200  industrial  and  university 
sites. 

VIII.  Conclusion 

Corner  stitching  is  a  powerful  technique  for  representing 
geometrical  data.  Its  two  most  important  features  are  a)  it  re¬ 
presents  empty  space  explicitly,  and  b)  it  links  together  tiles 
of  various  types  at  their  corners.  These  two  features  make  it 
possible  to  implement  a  variety  of  important  operations  that 
operate  purely  locally.  The  efficiency  of  the  algorithms  de¬ 
pends  only  on  local  information  and  not  on  the  overall  circuit 
size.  The  database  can  be  modified  incrementally,  so  that  one 
portion  of  the  design  can  be  changed  without  invalidating  the 
pointer  information  of  any  other  piece  of  the  design.  Corner 
stitching  is  effective  both  for  densely  packed  circuits  and  tor 
sparse  ones.  See  Table  III  for  a  summary  of  the  complexity 
of  the  various  algorithms. 

The  main  drawback  of  the  mechanism  is  that  it  requires 
approximately  three  times  as  much  storage  as  simple  mech¬ 
anisms.  Fortunately,  designers  tend  to  focus  their  attention 
on  a  small  portion  of  a  layout  at  any  given  time;  since  corner 
stitiching  uses  only  local  information,  it  will  have  good  paging 
behavior  in  a  demand-paded  environment. 
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Appendix  A 
Adjacencies 

The  running  time  for  several  of  the  algorithms  depends  on 
the  number  of  neighbors  an  individual  tile  has.  One  can  con¬ 
struct  situations  where  a  tile  has  an  arbitrarily  large  number  of 
neighbors,  so  it  is  not  possible  to  state  any  absolute  upper 
bounds.  However,  graph  theory  can  be  used  to  determine  the 
average  number  of  adjacencies.  In  any  connected  planar  graph 

n - e +/= 1 

where  n  is  the  number  of  nodes,  e  is  the  number  of  edges,  and 
/is  the  number  of  faces  contained  by  the  edges.  A  face  corre¬ 
sponds  to  a  tile,  a  node  to  a  corner  of  a  tile,  and  an  edge  to  a 
distinct  adjacency  between  two  tiles.  For  T  tiles, /=  T.  The 
number  of  distinct  nodes  n  can  be  at  most  4 T,  but  in  the  in¬ 
terior  of  the  tile  structure,  each  corner  of  one  tile  must  coin¬ 
cide  with  at  least  one  corner  of  another  tile  (a  “T”  structure). 
Thus  n  <  2T  and  the  total  number  of  adjacencies  is 

e-n+f-  1  <3  T-  1. 

Note  that  at  the  outside  of  the  structure  there  may  be  corners 
that  don’t  coincide  with  other  corners,  but  for  each  of  these 
there  is  also  as  least  one  edge  that  doesn’t  represent  an  adja¬ 
cency  (because  there  is  no  tile  on  the  other  side).  Hence  the 
3T-  1  upper  limit  is  not  affected. 

The  3T-  1  limit  counts  each  adjacency  only  once  for  the 
two  tiles  that  are  adjacent.  To  compute  the  number  of  neigh¬ 
bors  per  tile,  the  figure  must  be  doubled.  This  means  that  on 
the  average,  an  individual  tile  will  have  about  six  neighbors, 
or  about  one  or  two  on  each  side.  This  is  regardless  of  the 
arrangement  of  tiles.  Of  course,  if  there  are  many  tiles  of 
different  sizes,  the  large  tiles  may  have  many  more  than  six 
neighbors.  The  average  number  of  neighbors  of  a  tile  in  a 
situation  like  this  will  be  roughly  proportional  to  the  perim¬ 
eter  of  the  tile,  which  is  less  than  linear  in  its  area. 

Appendix  B 

Splitting  and  Merging 

This  section  discusses  the  cost  of  splitting  one  tile  into  two 
adjacent  tiles,  or  merging  two  adjacent  tiles  into  a  single  tile. 
A  tile  can  be  split  into  two  tiles  as  follows: 

1)  Make  an  exact  copy  of  the  original  tile. 

2)  Update  the  coordinates  of  each  tile  to  reflect  the  split, 
and  set  the  tiles’  corner  stitches  to  refer  to  each  other. 

3)  Update  the  corner  stitches  in  tiles  that  are  now  adjacent 
to  the  new  tile.  To  do  this,  use  the  neighbor-finding  algorithm 
to  locate  the  neighbors  on  three  sides  of  the  original  tile,  then 
update  the  stitches  that  must  point  to  the  new  tile. 

The  algorithm  for  merging  two  adjacent  tiles  into  a  single 
larger  tile  is  similar  stitches  must  be  updated  along  three  sides 
of  the  tile  that  is  eliminated. 

The  cost  of  each  algorithm  consists  of  constant  factors 
(copying  a  tile  or  changing  an  x  or  y  coordinate)  and  the 
search  of  neighbors  on  three  sides.  Appendix  A  showed  that 
the  number  of  neighbors  was  constant  when  averaged  across  a 
whole  design,  but  increases  for  those  tiles  thai  are  much  larger 


Fig.  18.  An  exampte  of  visibility  searching,  From  tile  A',  liles/1,  B,  C, 
and  D  are  visibte  to  the  right.  Tile  E  is  not  visibte  from  X  Tite  D 
has  two  distinct  windows  of  visibility  to  X,  one  between  A  and  B 
and  one  below  C.  During  a  horizontal  visibility  search  from  A",  the 
pictured  corner  stitches  will  be  traversed. 

than  their  neighbors.  In  this  case,  the  average  number  of 
neighbors  will  be  approximately  proportional  to  the  perimeter 
of  the  tile.  Thus  the  cost  of  a  split  or  merge  is  constant  if  the 
tile  being  split  or  merged  is  about  the  same  size  as  its  neigh¬ 
bors.  If  the  tile  is  much  larger  than  its  neighbors,  then  the  cost 
increases  in  proportion  to  the  tile's  perimeter,  which  is  less 
than  linear  in  its  area. 

Appendix  C 
Visibility  Searching 

This  section  gives  algorithms  that  locate  all  solid  tiles  visible 
on  one  side  of  a  given  solid  tile.  Two  solid  tiles  are  mutually 
visible  if  it  is  possible  to  draw  a  horizontal  or  veriical  line  be¬ 
tween  them  without  crossing  any  other  solid  tiles.  Fig.  18 
gives  examples  of  visible  and  invisible  tiles.  Visibility  searching 
is  used  during  compaction  and  stretching.  Unfortunately,  the 
horizontal-strip  representation  of  space  requires  different  al¬ 
gorithms  for  horizontal  and  vertical  searches. 

Horizontal-visibility  searching  is  based  on  the  neighbor-find¬ 
ing  algorithm  of  Section  V-5.2.  The  following  algorithm  is  for 
searching  on  the  right  side  of  the  original  tile;  it  can  be  modi¬ 
fied  to  search  on  the  left  side. 

1)  Use  the  neighbor-finding  algorithm  to  enumerate  the  tiles 
that  touch  the  right  side  of  the  starting  tile.  For  each  tile 
found,  execute  step  2)  or  step  3),  depending  on  the  tile’s  type. 

2)  If  the  neighbor  is  solid,  then  it  is  automatically  visible. 

3)  The  neighbor  is  a  space  tile.  If  it  extends  all  the  way  to 
the  edge  of  the  circuit  (infinity)  ignore  it.  Otherwise,  use  the 
neighbor-finding  algorithm  once  again  to  enumerate  all  the 
tiles  that  touch  its  right  side.  Each  of  these  must  be  a  solid 
tile.  All  of  the  tiles  whose  bottom  edges  are  lower  than  the 
top  edge  of  the  starting  tile  are  visible. 

In  this  algorithm,  a  single  solid  tile  may  be  enumerated  se\ 
eral  times,  once  for  each  distinct  window  ot  visibility  with  the 
starting  tile  (see.  for  example,  tile  l)  in  Fig.  18).  The  time 
required  for  the  horizontal  search  is  linear  in  the  total  numbe; 
of  tile  adjacencies  in  the  search  area,  which  was  shown  in 
Appendix  A  to  be  linear  in  ihe  number  of  tiles.  Since  there 
must  be  at  least  one  solid  tile  enumerated  for  every  space 
tile  enumerated,  the  expected  running  tune  of  the  search  is 
linear  in  the  number  of  solid  tiles  found.  For  tiles  tit  a  rela¬ 
tively  uniform  distribution,  the  number  ol  visible  neighbors 
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Fig.  19.  A  pathological  case  for  horizontal  visibility  searching.  When 
searching  for  tiles  visible  to  the  right  of  X,  all  of  the  tiles  above  A 
will  have  to  be  passed  through  and  skipped  over. 


Fig.  20.  In  a  downward  visibility  search  from  X,  columns  of  space 
tiles  are  traversed  until  solid  tiles  are  found  or  the  end  of  the  circuit 
is  reached.  In  this  case,  the  numbers  give  the  order  in  which  the 
tiles  will  be  traversed  (the  numbering  ignores  realignments  that  must 
occur  when  advancing  down  the  side  of  a  column).  Tiles  that  fall 
under  more  than  one  column  are  traversed  more  than  one  time. 

of  a  given  tile  is  small  and  independent  of  the  size  of  the  cir¬ 
cuit.  However,  if  a  space  tile  found  in  step  3)  extends  above 
the  starting  tile,  as  in  Fig.  19,  it  could  have  any  number  of 
out-of-range  solid  tiles  along  its  right  edge;  sin:e  these  have 
to  be  skipped  over,  the  upper  limit  on  the  running  time  for 
the  algorithm  is  the  total  number  of  solid  tiles  in  the  circuit. 

For  vertical  visibility  searches,  the  algorithm  is  an  extension 
of  the  area-search  algorithm  of  Section  V-S.3.  It  consists  of  a 
recursive  set  of  searches  of  successively  thinner  columns.  The 
following  algorithm  will  find  all  the  solid  tiles  visible  below 
the  starting  tile;  it  can  be  modified  to  find  all  those  above  the 
starting  tile.  See  Fig.  20  for  an  example. 

1)  The  intial  column  being  searched  extends  downward 
from  the  bottom  of  the  starting  tile.  Use  the  approach  of  the 
area-searching  algorithm  to  advance  one  by  one  through  the 
tiles  lying  under  the  left  edge  of  this  column. 

2)  When  a  space  tile  is  found  in  step  1),  check  to  see  if  it 
extends  across  the  whole  column.  If  so,  then  advance  down¬ 
wards  to  the  next  tile  (this  is  the  case  for  tiles  1  and  2  in  Fig. 


Fig.  21  Tile  5  causes  severe  misalignment  during  downward  visibility 
searches  from  X  Each  of  tiles  E-l  will  have  to  be  traversed  for  each 
column  between  tiles  A-D. 

20).  If  the  space  tile  extends  downward  to  -infinity,  then 
return. 

3)  If  a  solid  tile  is  found,  or  if  the  space  tile  does  no  extend 
across  the  entire  column,  then  do  not  continue  down  any  fur¬ 
ther.  Instead,  scar,  across  the  top  of  the  column  (following  tr 
stitches  from  the  tile  found  in  step  I).  Each  of  the  solid  tiles 
found  in  this  way  is  visible  to  the  starting  tile.  For  each  space 
tile  found  in  this  scan,  invoke  a  recursive  search  on  the  column 
underneath  this  space  tiie  (tiles  3  and  12  in  Fig.  20  are  exam¬ 
ples  of  space  tiles  that  start  new  column  searches). 

The  algorithm  terminates  when  all  of  the  columns  have  been 
closed  off  by  continuous  solid  tiles  across  the  columns  or 
when  the  end  of  the  circuit  (infinity)  is  reached.  As  with  the 
horizontal  search,  tiles  are  enumerated  once  for  each  window 
of  visibility  with  the  starting  tile.  Since  each  of  the  visible 
files  is  visited  once  for  each  window  of  visibility,  the  expected 
running  time  is  linear  in  the  number  of  visible  tiles  (for  rela- 
iively  uniform  tile  distributions).  However,  the  same  misalign¬ 
ment  that  was  illustrated  in  Fig.  9  for  area  searching  can  occur 
here,  as  shown  in  Fig.  21.  In  the  unlikely  event  that  most  of 
the  tiles  in  the  circuit  are  piled  up  like  tiles  E-l  in  Fig.  21,  they 
will  all  have  to  be  traversed  as  part  of  each  column,  and  the 
total  running  time  will  be  proportional  to  the  product  of  total 
circuit  size  and  number  of  visible  tiles. 

Each  of  the  visibility  algorithms  works  in  a  particular  direc¬ 
tion.  The  directed  nature  is  important  to  other  algorithms 
that  use  visibility  searches.  The  right  visibility  search  enumer¬ 
ates  visible  tiles  in  order  from  top  down,  and  the  bottom  visi¬ 
bility  search  enumerates  visible  tiles  in  order  from  left  to  right. 

Appendix  D 

Plowing  in  Linear  Time 

The  poor  worst-case  behavior  cf  the  plowing  algorithm  in 
Section  5.7  occurred  because  the  algorithm  processed  tiles  in 
a  haphazard  order.  As  a  result,  some  tiles  could  be  moved 
.many  times  as  the  algorithm  discovered  that  more  and  more 
space  was  needed  to  move  other  tiles  out  of  the  plow  area. 
Hie  linear-time  algorithm  makes  two  passes  over  the  circuit. 
In  the  first  pass,  it  computes  how  far  each  tile  must  move; 
tiles  3re  processed  in  topological  order  so  that  a  given  tile 
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is  not  processed  until  its  final  position  is  known.  In  the 
second  pass,  the  tiles  are  actually  moved.  Each  tile  is  moved 
exactly  once. 

The  linear-time  algorithm  requires  an  extra  data  value  to  be 
stored  in  each  tile.  This  additional  value  is  called  the  tile’s 
delta ,  and  gives  the  distance  that  the  tile  must  be  moved.  Ini¬ 
tially,  all  of  the  deltas  are  zero;  when  the  plowing  algorithm 
is  finished,  it  leaves  all  the  deltas  zero  for  future  plowing.  The 
linear-time  algorithm  also  requires  the  use  of  a  linked  list  of 
tiles  to  be  moved.  Tiles  are  added  to  the  linked  list  in  the  first 
pass;  in  the  second  pass,  tiles  are  moved  in  list  order. 

Pass  1  consists  of  setting  the  delta  of  the  initial  tile  to  its 
plow  distance  and  calling  the  following  recursive  procedure  to 
process  the  tile.  The  basic  algorithm  is  independent  of  the 
plow  direction. 

1)  Add  a  pointer  to  the  current  tile  onto  the  front  of  the 
list  of  tiles  to  be  moved.  This  tile  will  be  moved  before  all 
previously  encountered  tiles. 

2)  Use  the  tile's  delta  and  location  to  compute  the  area 
that  this  tile  will  plow  out  as  it  moves. 

3)  Use  the  visibility  search  from  Appendix  C  to  enumerate 
all  visible  solid  tiles  in  the  plow  area.  Execute  steps  4)  and  S) 
for  each  neighbor  tile  found  in  this  way. 

4)  Compute  the  delta  required  to  move  the  tile  out  of  the 
plow  area.  If  this  delta  is  greater  than  the  tile's  current  delta, 
then  update  the  tile’s  current  delta. 

5)  If  this  is  the  last  time  we  will  ever  see  the  neighbor,  then 
call  this  procedure  recursively  to  process  the  neighbor.  The 
determination  of  “last  time”  depends  on  the  directed  nature 
of  the  algorithms  for  visibility  searching.  For  example,  in  a 
left-to-right  compaction,  the  “last  time”  is  when  the  neigh¬ 
bor’s  bottom  edge  is  in  the  window  of  visibility,  or  when  the 
bottom  edge  of  the  overall  plow  area  is  in  the  window  of  visi¬ 
bility.  The  bottom  edge  of  the  overall  plow  area  must  be 
passed  down  in  the  recursive  calls;  it  is  the  lowest  bottom  edge 
for  any  plow  on  the  recursive  stack. 

Pass  2  scans  the  list  in  order  from  front  to  back.  Each  tile 
on  the  list  is  erased,  then  recreated  at  a  new  position  deter¬ 
mined  by  its  delta.  The  ordering  of  the  list  guarantees  that 
the  final  position  of  each  tile  is  empty  at  the  time  it  is  moved. 
When  moving  the  tiles,  the  deltas  are  zeroed  out  again  in  prep¬ 
aration  for  the  next  plowing  operation. 

If  the  total  number  of  tiles  moved  is  M  and  the  total  number 
of  tiles  in  the  circuit  is  N,  then  each  of  the  two  passes  has  an 
expected  running  time  that  is  of  order  M,  with  worst -case  run¬ 
ning  time  proportional  to  MN.  In  pass  1 ,  the  recursive  pro¬ 
cedure  is  invoked  exactly  M  times  (once  for  each  tile  that  must 
be  moved).  The  overall  running  time  for  pass  I  is  determined 
by  the  time  spent  in  enumerating  all  the  visible  neighbors  for 
all  the  tiles  that  are  moved.  In  the  average  case,  each  tile’s 
visible  neighbors  can  be  found  in  constant  time,  so  the  total 


running  time  is  proportional  to  M.  In  the  worst  case,  the  cost 
of  the  visibility  searches  may  be  MN,  so  the  worst-case  running 
time  of  pass  I  is  of  order  MN. 

For  pass  2,  the  expected  time  to  delete  or  create  each  tile 
is  constant,  so  the  expected  running  time  is  linear  in;!/.  How¬ 
ever,  the  worst-case  deletion  or  creation  time  for  a  tile  is 
proportional  to  the  overall  circuit  size,  so  the  worst-case  run¬ 
ning  time  for  pass  2  is  of  order  MN. 

Acknowledgment 

Michael  Arnold,  Carlo  Sequin,  David  Ungar,  and  David 
Wallace  all  took  part  in  the  discussions  that  led  to  the  for¬ 
mulation  of  corner  stitching.  C.  Sequin  developed  the  proof 
that  3 ;V  +  I  space  tiles  are  always  sufficient  in  a  design  with 
N  solid  tiles.  Gordon  Hamachi,  Bob  Mayo.  Walter  Scott,  and 
George  Taylor  implemented  the  layout  editor  based  on  corner 
stitching.  In  addition  to  these  people.  Leo  Guibas,  David 
Patterson,  Alberto  Sangiovanni-Vincentelli,  and  the  referees 
all  provided  helpful  comments  on  drafts  of  this  paper. 

References 

[1]  M.  H.  Arnold  and  J.  K.  Ousterhout,  “Lyra:  A  new  approach  to 
geometric  layout  rule  checking,"  in  Proc.  1 9th  Design  Automation 
Conf.  pp.  530-S36.  1982. 

[ 2 1  J.  L.  Bentley  and  J.  H.  Friedman.  “A  survey  of  algorithms  and 
data  structures  for  range  searching.”  ACM  Computing  Surveys, 
vol.  11,  no.  4,  1979. 

1 3 1  M.  Y.  Hsueh.  “Symbolic  layout  and  compaction  of  integrated  cir¬ 
cuits,"  University  of  California,  Berkeley.  Tech.  Rep.  UCB/ERL/ 
M  79/80.  Dec.  1979. 

(4)  G.  Kedem.  "The  quad-CIF  tree:  A  data  structure  for  hierarchical 
on-line  algorithms.”  in  Proc.  19th  Design  Automation  Conf.,  pp. 
352-357.  1982. 

{5 1  K.  H.  Keller  and  A.  R.  Newton.  “K1C2:  A  low  cost,  interactive 
edilor  for  integrated  circuit  design.”  in  Dig.  Papers  for  COMPCON 
Spring  1982,  pp.  305-306. 

[6|  J.  K.  Ousterhout.  “Caesar:  An  interactive  editor  for  VLSI.”  VLSI 
Design,  vol.  II,  no.  4,  pp.  34-38,  fourth  quarter  1981. 

[  7 1  J.  K.  Ousterhout,  and  D.  M.  Ungar,  “Measurements  of  a  VLSI  de¬ 
sign,"  in  Proc.  19th  Design  Automation  Conf.,  pp.  903-908,  1982. 
[81  A.  Sangiovanni-Vincentelli.  private  communication. 


John  K.  Ousterhout  received  the  B  S.  degree  in 
physics  from  Yale  College.  New  Haven.  CT,  in 
1975  and  the  Ph.D  degree  in  computer  science 
from  Carnegie-Mellon  University,  Pittsburgh. 
PA,  in  1980. 

Since  1980  he  has  been  an  Assistant  Professor 
of  Llcctrical  Engineering  and  Computer  Sci¬ 
ences  at  the  Berkeley  campus  of  the  University 
of  California.  Ills  research  interests  include 
computer-aided  design,  VLSI  architecture,  and 
operating  systems. 


The  following  section  contains  papers  and  reports  relating  to  research  in 
computer  Circuit  &  System  Design.  They  describe  work  which  was  wholly  or 
in  part  performed  under  the  sponsorship  of  the  DARPA  grant. 


1° 

(1)  R.  Kavaler,  T.  Noll,  H.  Murviet,  M.  Lowy  and  R.W.  Brodersen,  ”A  Dynamic  Time 

J  Warp  IC  for  a  1000  Word  Recognition  System,”  Proc.  of  ICASSP,  San  Diego, 

March,  1984. 

(■j 

(2)  P.  Ruetz,  S.P.  Pope,  B.  Solberg  and  R.W.  Brodersen,  "Computer  Generation 

|  #  of  Digital  Filter  Banks,”  ISSCC Digest  of  Technical  Papers  Feb  1984. 

(3)  S.P.  Pope,  B.  Solberg  and  R.W.  Brodersen,  "A  Single-Chip  LPC  Vocoder"  Tech. 

*  Digest  of  the  ISSCC,  Feb.  1984. 

j  (4)  C.C.  Hsiao  and  R.W.  Brodersen,  "A  Muitirate  Root  LPC  Synthesizer,"  Proceed- 

S®  ing  of  ICASSP,  March,  19B4. 


s 

j 


A  DYNAMIC  TIME  WARP  IC  FOR  A  _ 

ONE  THOUSAND  WORD  RECOGNITION  SYSTEM 


Robert  Kavaler,  R.W.  Broderien 
University  of  California,  Berkeley  CA  94720 

Tobias  G.  Noll 

Seimeni  AC,  Munich  Germany 

Menacbem  Lowy 
GE,  Schenectady  NY 

Hy  Murveit 

SRI  International,  Menlo  Park  CA  94025 


ABSTRACT 

Dynamic  time  warping  is  considered  a  superior  way 
to  perform  time  alignment  in  speech  recognition,  [lj 
Unfortunately  dynamic  programming  algorithms  require 
too  much  computation  for  conventional  computer  archi¬ 
tectures  to  handle  and  still  provide  good  response  time 
with  1000  reference  words.  This  paper  presents  a  single 
chip  that  is  capable  of  the  performing  the  dynamic  time 
warp  processing  necessary  for  recognizing  1000  words  in 
real-time. 


INTRODUCTION 

M0S-LS1  technology  has  made  it  possible  to  design 
circuits  capable  of  processing  a  large  number  of  complex 
operation*  on  a  single  chip  Thia  technology,  whan 
applied  to  the  speech  recognition  task,  allows  one  to 
design  a  chip  capable  uf  performing  all  Uiv  computations 
necessary  to  recognize  words  from  a  dictionary  of  1000 
words  in  real  time  using  a  dynamic  time  warp  algorithm. 
In  this  case  "real  time"  means  that  there  is  a  very  short 
(less  than  20ms)  delay  from  the  detected  end  of  the  spo¬ 
ken  word  to  the  time  that  the  recognition  decision  is 
made.  The  algorithm  that  was  implemented  is  general 
enough  so  that  connected  speech  can  be  recognized 
without  any  speed  penalty.  Also,  a  character  recognition 
project  uses  this  exact  same  chip  for  pattern  matching. 

Our  chip  does  not  use  a  general  purpose  architec¬ 
ture.  Instead  we  used  a  very  parallel  and  pipelined  archi¬ 
tecture  and  thus  can  run  much  taster  than  other  chips 
that  perform  similar  functions. 

Before  going  any  further  we  should  define  two  term* 
used  in  thia  paper:  template  and  frame  A  template  is  a 
pattern  that  represents  a  typical  eay  that  a  word  or  a 
phrase  might  be  spoken  Templates  are  ordered 
sequences  of  frames,  where  each  frame  represents 
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short-term  spectral  information. 


THE  TIME  WARP  ALGORITHM 

Time  alignment  using  a  dynamic  programming  algo¬ 
rithm  is  a  very  popular  means  of  providing  high  accuracy 
in  many  current  speech  recognition  systems  Unfor¬ 
tunately.  these  algorithms  run  very  slowly  when  imple¬ 
mented  on  a  standard  computer  architecture.  [2.3.4)  It 
no  constraints  are  placed  on  the  system  then  each  frame 
of  an  incoming  unknown  utterance  must  be  compared  to 
•acb  frame  of  each  template  In  the  dictionary  For  each 
frame-to-frame  comparison,  one  must  compute  a 
Euclidean  distance  between  two  N-element  vectors,  find 
the  minimum  among  3  numbers,  and  add  that  minimum 
tn  the  Euclidean  distance.  Thm  proce**  take*  many 
instructions  on  most  computers,  so  pruning  techniques 
were  used  to  eliminate  the  need  to  compute  all  possible 
frame-to-frama  comparisons. 

Our  chip  (the  time  warp  chip)  does  not  prune  paths 
as  a  method  of  eliminating  frame-to-frame  comparisons. 
Instead,  we  have  built  a  chip  that  has  the  power  to  com¬ 
pute  all  of  these  comparisons  very  quickly  using  a  pipe¬ 
lined  and  parallel  architecture.  The  additional  circuit'}- 
needed  to  perform  the  pruning  would  have  been  much  too 
expensive.  The  comparisons  are  computed  on  a  column 
by  column  basis  thus  allowing  one  to  start  processing  an 
unknown  utterance  before  It  is  spoken  completely. 

The  algorithm  implemented  is  similar  to  those 
presented  by  Sakoe  and  Chiba  [  5]  The  distance  between 
two  words  is  defined  as  6,*): 

*  SC0*-**,)* 

i-o 

Djt  •  0/a-i. 

The  boundary  values  of  D  are  fixed  to  be  infinite  along  the 
template  word  axis  and  programmable  along  the  unknown 
word  axis.  This  allows  a  connected-word  algorithm  to  be 
implemented  without  additional  hardware.[6j  No  pruning 


V^v'-V-V-' 


v 


•Y> 

'a 


>*« 

vj 


00 
-« .-1 
O' 


.'V! 


V 


,V. 


ss 

4, 


SB 

( 


1 234  • • •  FRAMES 


| _ UNKNOWN 

I  UTTERANCE' 


t 


Figure  Z  Syitem  Block  Diagram 

dynamic  programming  calculations.  This  memory  is 
made  srith  12  16Kx4  chips  (TMS4416-15)  The  general- 
purpose  processor  has  its  own  memory  (12BK  bytes), 
allowing  it  operate  independently  of  the  time  warp  chip. 


,4 


Figure  1:  Computation 

or  global  slope  constraints  are  used,  thus  each  frame  of 
the  unknown  word  must  be  compared  to  eacb  frame  in 
the  template  word  memory.  Local  slope  constraints  can 
be  applied  as  a  programmable  option. 


SYSTEM  ARCHITECTURE 

The  time  warp  chip  Is  the  processing  element  in  a 
more  complex  system.  An  entire  speech  recognition  sys¬ 
tem  is  being  designed  on  a  single  Multi-bus  board.  The 
system  consists  of  the  time  warp  chip,  a  digital  filter 
bank  chip.[7]  a  general-purpose  processor  (used  for 
higher  level  processing,  or  as  a  general-purpose  com¬ 
puter).  and  enough  memory  to  store  over  1000  tem¬ 
plates.  The  time  warp  chip  Interface*  to  the  general  pur¬ 
pose  processor  through  a  dual-ported  memory,  a  dma- 
driven  parallel  port,  and  a  general  parallel  port  The  dual 
ported  memory  contains  the  templates,  the  boundary 
value  for  D  (bottom  score),  and  the  unknown  word  frame. 
The  bottom  score  and  unknown  word  frame  are  updated 
before  each  column  is  computed.  The  dma-driven  port 
allows  the  general  purpose  processor  to  catcb  the  final 
scores  between  eacb  template  and  the  unknown  word. 
The  other  port  tells  the  chip  to  start,  a  new  column. 

For  1000  srordx  one  needs  enough  memory  for  25000 
frames  (25  frames  per  word).  We  have  set  aside  enough 
space  for  32000  frames  (using  32  standard  HM4664P-2 
B*Kxl  dynamic  RAMs).  In  addition  to  the  template 
memory  a  scratch-pad  memory  is  needed  for  the 


CMP  OPERATION 
The  time  warp  chip  has  68  pins: 

3  -  Power.  Ground.  Substrate 
16  -  Input  data  from  template  memory 
24  - 1/0  data  for  scratch  pad  memory 
IB  •  Address  lines  (shared)  for 

template  and  scratch  pad  memories 


3  •  Control  line  inputs: 

Start  First  Column  (SC0L) 

Start  Other  Columns  (SREC) 

Clock 

4  -Ccwitrnl  line  output*: 

End  of  Column  (E0C) 

Bid  of  Template  (EOT) 


RAS  for  scratch  pad  Address 
Read/Write  from  scratch  pad  memory 

The  chip  starts  up  In  an  idle  mode,  where  It  performs 
sequential  reads  from  template  and  scratch  pad 
nwmnrim  t.o  refresh  dynamic  memories.  When  either  a 
SC0L  or  SREC  signal  Is  received  the  address  counter  goes 
to  0  and  reads  the  first  word  of  template  memory  into  tbe 
bottom  score  register.  The  next  3  words  are  Ignored. 
Then  one  frame  (4  words)  of  the  unknown  utterance  is 
read  Into  an  Internal  memory.  Next  the  chip  performs 
the  time  warp  algorithm  with  the  remaining  templates. 
The  end  of  a  template  Is  Indicated  by  the  special  code 
FFFF  hex  as  its  first  word.  If  tbe  second  word  is  FFFF 
then  the  end  of  column  has  been  reached.  At  the  end  of 
ootumn  and  the  end  of  template  the  appropriate  outputs 
arc  strobed  to  allow  scores  to  be  retrieved  by  the  general 
purpose  processor.  At  the  end  of  a  column  the  chip 
returns  to  tbe  idie  mode. 


1 


template  2 


frame  I  of  template  2 
FFFF  I  0000  I  0000  0000 


frame  7  of  templale  I 


’  template  I 


12  from*  2  of  tetrekite  I 
6  from*  I  of  template  I 


unknown  word  frame 


_  bottom 
°  ecore 


Figure  3:  Memory  Organization 


CHIP  ARCHITECTURE 
The  chip  consists  of  5  functional  units: 

1)  A  distance  processor  that  can  compute  a  4- 
dimensional  Euclidean  distance  every  clock  cycle. 

2)  A  pipeline  accumulator  that  sums  four  4-dimensional 
Euclidean  distances  into  one  10  dimensional  distance. 

3)  A  dynamic  programming  processor  that  can  compute 
one  minimization  and  sum  every  4  clock  cycles. 

4)  An  addressing  unit  for  the  external  template  and 
scratch-pad  memories. 

5)  A  controller  for  each  of  the  above  processors. 

Each  processor  has  a  custom  designed  architecture. 
The  controller  Is  a  combination  of  custom  designs  and 
standard  PLA  finite  state  machines. 

The  distance  processor  has  a  four  level  pipeline. 
First  four  4-bit  differences  and  absolute  values  are  com¬ 
puted  in  parallel.  Second  these  differences  are  squared 
(using  a  PLA)  resulting  in  four  6-bit  values.  Third,  these 
6-tit  values  are  summed  pairwise  into  two  9-bit  values. 
Finally  a  16-bit  sum  is  computed.  This  value  Is  saturated 
to  6-blts.  The  dynamic  programming  processor  also 


Figure  4:  Chip  Block  Diagram 


computes  the  projection  of  the  path  on  the  unknown 
utterance  axis.  This  is  needed  in  connected  speech  appli¬ 
cations.  The  projection  is  computed  with  an  B-hit 
saturating  counter. 

Four  of  these  differences  are  then  accumulated  into 
on  B-bit  register.  The  resulting  B-bit  sum  is  then  sent  to 
the  dynamic  programming  processor. 

The  dynamic  programming  processor  has  three  18- 
bit  registers  corresponding  to  all  possible  paths  into  a 
given  element  of  the  DP  matrix.  One  of  the  16  bits  is  used 
for  slops  constraints.  Two  of  these  registers  are  fed  from 
memory,  and  one  is  an  accumulator.  To  process  a  node 
(frame)  one  must  first  compute  the  minimum  or  these 
three  registers,  then  sum  In  with  the  distance  above. 
Three  comparators  and  a  PLA  are  used  to  compute  the 
minimum  of  the  registers.  The  PLA  contains  the  ruies  for 
handling  the  first  row  and  first  column  of  the  DP  matrix. 
The  summing  output  is  saturated  to  13  bits  to  prevent 
overflow. 

Due  to  bandwidth  considerations,  the  address 
counter  must  count  up  2  then  down  1.  The  new  DP  sums 
•re  written  after  the  decrement,  the  old  DP  sums  are 
read  after  the  increment  This  sequence  requires  a  spe¬ 
cial  counter. 

The  system  is  controlled  with  a  standard  finite  state 
machine.  There  Is  also  a  circuit  that  computes  various 
pipeline  timing  signals,  and  a  high  speed  counter  for 
Internal  synchronization. 

The  chip  is  implemented  in  a  4  micron  KMOS  pro¬ 
cess,  has  an  active  area  of  20,000  square  mils,  and  runs 
with  a  3kHz  dock. 


CONCLUSIONS 


F 


■o 


I'JMJI  J  J 


i  nm  p  v j  mm  m|  pp 


An  NMOS  LSI  chip  wu  detuned  that  hai  enough  pro¬ 
cessing  power  to  perform  the  dynamic  programming 
algoriUim  used  to  recognize  1000  words  in  real  time.  Ibis 
chip  can  be  used  in  either  isolated  word  or  connected 
word  applications.  Tbe  speed  of  chip  was  accomplished 
by  using  a  highly  parallel  and  pipelined  custom  design. 
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ABSTRACT 

In  order  to  reduce  the  design  time  of  digital  filter  bank  cir¬ 
cuits,  a  design  system  has  been  developed.  The  software  consists 
of  the  filter  compiler  which  converts  high  level  filter  descriptions 
to  hardware  descriptions  and  the  layout  generator  which  converts 
the  hardware  descriptions  to  a  layout  file.  To  verify  the  algorithms 
before  fabrication,  a  test  system  is  employed.  The  development 
time  of  this  system  was  kept  to  a  minimum  by  designing  the 
hardware  to  be  easily  micro  coded  and  assembled.  Several  circuits 
have  been  fabricated  and  tested  that  were  generated  with  this  sys¬ 
tem.  including  a  single  band  pass  filter  chip,  a  112  pole  16  channel 
filter  bank  for  a  speech  recognition  system  and  a  16  channel  spec¬ 
trum  analyzer  for  consumer  stereo  applications.  The  speech 
recognition  chip  achieved  a  SNR  of  BO  dB  with  an  area  of  25  sq  mm 
in  a  4  micron  NMOS  technology. 
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]  INTRODUCTION 

Fully  automated  design  of  complex  integrated  circuits  has  often  resulted  in 
limited  usefulness  because  of  poor  performance  or  inefficient  silicon  space  utili¬ 
zation.  If  few  restrictions  are  placed  on  the  function  of  the  ICs  to  be  generated, 
then  the  optimization  problem  becomes  difficult,  yielding  circuits  far  inferior  to 
custom  designs.  Another  important  aspect  of  using  automated  design  systems 
is  the  time  required  to  develop  the  software  and  its  reliability.  There  is  no 
advantage  in  reducing  hardware  design  time  if  the  resultant  software  develop¬ 
ment  effort  becomes  equally  time  consuming  and  error  prone. 

It  has  become  apparent  that  tradeoffs  between  development  time,  general¬ 
ity  and  Anal  circuit  performance  must  be  made.  The  design  system  described 
here  was  based  on  an  emphasis  on  high  performance  with  minima)  software 
effort  Instead  of  treating  the  software  and  hardware  designs  as  distinct  prob¬ 
lems,  the  hardware  architecture  and  layouts  were  designed  in  a  way  that  made 
automation  simpler  while  maintaining  performance. 

Some  automated  design  systems  have  been  developed  which  allow  the  user 
to  interact  only  at  the  highest  level.  If  this  is  incompatible  with  the  require¬ 
ments  of  the  user,  then  the  entire  design  system  is  of  no  use.  If  however,  the 
software  is  designed  to  allow  the  user  to  operate  at  a  lower  level,  more  jobs  can 


be  accomplished.  Our  design  system  has  been  developed  in  a  hierarchical 
manner.  For  those  wishing  to  generate  filter  banks,  the  task  can  be  accom¬ 
plished  from  the  highest  level.  Le.  totally  automated.  The  user  can  also  use  the 
lower  levels  of  the  system  (i.e  only  partially  automated)  for  other  applications. 
IVirther.  the  system  can  be  extended  at  the  highest  level  for  the  specific  needs 
of  the  user. 

The  scope  of  applications  that  has  been  chosen  is  digital  filter  banks  which 
are  a  parallel  and/or  cascade  connection  of  filter  sections.  Digital  filter  banks 
are  found  in  applications  as  diverse  «ls  MODEMs  and  spectrum  analyzers  for 
speech  recognition,  channel  vocoders,  consumer  stereo  and  EEC  analysis.  Deci¬ 
mation  and  rectification  are  required  in  addition  to  digital  filtering  in  the  spec¬ 
trum  analysis  applications. 

H  THE  HIERARCHY 

Currently,  the  hierarchy  is  four  levels  deep.  At  the  lowest  level  are  the  cir¬ 
cuit  ‘cells*.  These  cells  consist  of  basic  building  blocks  such  as  counters, 
adders,  RAM  cells.  ROM  cells,  etc.  The  cells  can  be  used  without  any  automation 
for  a  totally  manual  design.  At  the  next  level,  the  layout  generator  assembles 
these  cells  into  more  complex  blocks  such  as  data  paths  and  controllers  from 
hardware  descriptions.  This  would  be  useful  for  users  that  desire  a  signal  proces¬ 
sor,  but  need  a  few  additional  circuits  that  have  not  been  designed  or  are  not 
handled  by  the  layout  generator.  The  user  would  only  have  to  specify  the 
hardware  specifications  including  the  RAM  size  and  ROM  contents  and  add  the 
new  circuit  blocks  to  form  a  completed  chip.  At  the  third  level,  the  layout  gen¬ 
erator  assembles  the  data  path  and  the  controller  Into  a  completed  chip.  For 
those  that  have  a  non-filter  digital  signal  processing  application,  a  chip  could  be 
generated  completely  from  this  hardware  description.  Finally,  at  the  highest 
level,  the  filter  compiler  generates  the  hardware  description  from  a  digital  filter 
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description.  At  this  level,  digital  filter  banks  can  be  generated  completely 
automatically. 

To  generate  filter  bank  chips,  the  design  procedure  shown  in  figure  1  is  fol¬ 
lowed.  TL:  digital  filter  bank  structure  and  coefficients  are  specified  in  a  input 
file.  The  filter  compiler  converts  the  input  file  to  the  hardware  description.  To 
check  the  algorithms  before  the  circuit  is  fabricated,  the  hardware  description 
can  be  used  as  an  input  to  a  real-time  tester.  When  the  designer  is  satisfied,  the 
layout  generator  is  used  to  create  a  layout  file. 

ID  THE  CELLS 

The  basic  architecture  of  the  hardware  is  shown  in  figure  2.  There  are  two 
main  blocks:  the  conb-oller  consisting  of  the  program  counter,  the  ROM  and  the 
address  index  register  and  the  data  path  consisting  of  the  ALU  and  RAM. 

There  are  several  reasons  for  having  few  large  circuit  blocks.  The  block 
division  was  chosen  to  minimize  assembly  difficulty  while  retaining  adequate 
generality.  With  few  blocks,  the  automatic  assembly  is  simplified.  Routing 
difficulty  is  reduced  by  having  fewer  blocks  that  need  to  be  routed  together. 
The  blocks  are  also  made  up  primarily  of  abutting  circuit  cells  which  are  very 
simple  to  assemble. 

The  large  blocks  were  chosen  to  be  functionally  complete.  That  is,  the 
blocks  can  be  easily  used  to  perform  some  complex  function.  The  blocks  would 
be  complicated  to  use,  except  at  the  lowest  level  of  cells,  in  a  partially  assem¬ 
bled  form.  The  program  counter  may  be  useful  without  the  ROM  but  it  is  easily 
assembled  from  counter  cells  so  that  it  need  not  be  a  separate  block. 

There  is  also  a  somewhat  natural  division.  As  all  cells  in  the  data  path  could 
be  designed  with  the  same  bit  slice  pitch,  the  data  path  could  be  made  a  single 
block  requiring  no  data  bus  routing.  The  ROM  was  designed  to  minimize  area 


that  determined  the  pitch  of  ROM  cells.  The  ROM  cell  pitch  is,  however,  vastly 
different  from  the  pitch  of  the  control  lines  entering  the  data  path.  That  makrs 
it  more  efficient  to  optimize  each  as  separate  blocks  with  routing  between  the 
two  than  to  stretch  the  ROM  to  the  pitch  of  datapath  control  lines. 


HI-1  The  Controller 

The  controller  was  designed  to  be  small  with  high  performance.  To  achieve 
these  goals  it  was  made  very  simple  with  a  minimum  number  of  features.  For 
example,  there  is  no  branching  capability  or  micro  coded  instructions.  Adding 
complexity  can  result  in  vastly  increased  area  as  the  extra  registers  and  routing 
are  a  significant  fraction  of  the  controller.  ROM  bits  are  very  small,  regularly 
spaced  and  hence  very  efficient.  Instead  of  putting  the  convenience  of  micro 
coded  instructions  in  hardware,  it  is  put  in  the  software  (at  the  highest  level) 
where  it  does  not  add  to  the  silicon  area. 

Every  cycle  the  controller  outputs  a  valid  horizontal  control  word.  This  hor¬ 
izontal  control  word  specifies  the  value  of  every  data  path  control  line.  Each 
controller  output  comes  directly  from  the  ROM  with  the  exception  of  some  of  the 
RAM  address  lines  when  decimation  is  used.  With  decimation,  the  index  register 
modifies  the  RAM  address.  Although  this  increases  controller  complexity,  it 
saves  ROM  space  and  averts  the  need  to  perform  address  computations  in  the 
data  path.  The  data  path  is  never  used  for  any  control  operation,  allowing  con¬ 
tinuous  signal  processing.  The  circuit  is  also  more  compact  since  busses  are  not 
needed  to  connect  the  control  and  data  path. 


ID-2  The  Data  Path 

Figure  3  is  a  block  diagram  of  the  data  path.  As  in  most  signal  processors 
there  is  a  RAM,  adder,  accumulator,  some  form  of  negation/absolute  value  logic 
and  i/o.  However,  no  array  multiplier  is  included. 
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Again,  only  a  minimum  of  features  are  provided.  In  this  way  the  size  can  be 
kept  small  making  room  for  additional  data  paths  on  a  single  chip  for  greater 
throughput  The  cell  circuit  design  problem  is  also  reduced,  while  programming 
the  data  path  is  more  complicated.  This  is  not  a  problem  when  automation  is 
used,  as  the  filter  compiler  generates  and  optimizes  the  micro  code. 

Since  there  is  no  array  multiplier,  fixed  coefficient  multiplies  are  imple¬ 
mented  in  a  serial-parallel  manner  [l].  This  is  accomplished  with  the  use  of  the 
barrel  shifter,  adder  and  accumulator.  Since  a  restriction  of  fixed  coefficient 
multiplies  is  placed  on  the  system,  less  that  N  cycles  are  required  for  an  MxN 
multiply,  where  V  is  the  signal  width  and  N  is  the  multiply  coefficient  length,  by 
programming  the  ROM  properly.  Because  a  barrel  shifter  can  shift  several  (0  to 
S  in  this  case)  places  in  a  single  cycle,  multiplies  require  only  a  number  of 
cycles  equal  to  the  number  of  ‘l's  in  the  coefficient. 

By  using  coefficients  encoded  in  canonical  signed  digit  format  [2],  it  is  pos¬ 
sible  to  save  more  cycles  in  a  serial-parallel  multiply  when  there  are  more  than 
two  consecut've  ones  in  the  coefficient  This  arises  because  in  hardware  it  is 
Just  as  easy  subtract  as  it  is  to  add  a  partial  product.  For  example: 

to  perform:  (O.Ollll)Yn 


ripple  carry  adder  is  also  particularly  well  suited  for  a  bit  slice  design  which 
makes  the  automatic  layout  very  straight  forward.  The  output  of  the  adder 
saturates  instead  of  simply  overflowing  to  prevent  limit  cycles.  This  is  easily 
incorporated  in  the  hardware  but  would  require  several  cycles  per  computation 
to  implement  in  software. 

Pipelining  in  the  data  path  increases  the  performance  of  the  circuit  by 
making  higher  clock  rates  feasible.  The  pipeline  registers  are  at  the  output  of 
the  HAM.  the  input  of  the  RAM  and  at  output  of  the  adder  (the  accumulator). 
With  pipelining,  the  RAM  and  the  barrel  shifter,  adder  combination  both  get  a  full 
cycle  for  operation.  Although  pipelining  makes  micro  coding  more  difficult,  it  is 
transparent  to  the  user  when  the  filter  compiler  is  used. 

The  memory  input  register  is  the  only  register  of  the  three  which  can  be 
selectively  written.  In  some  cases,  the  result  of  a  computation  can  be  held  in 
this  register  until  the  RAM  is  inactive  during  a  serial-parallel  multiply.  At  this 
point,  the  result  can  be  written  into  the  RAM  without  requiring  an  extra  cycle. 
Proper  use  of  this  register  reduces  the  length  of  the  micro  code  by  preventing 
the  data  path  from  becoming  memory  bound. 

The  RAM  is  a  four  transistor  dynamic  type  with  a  schematic  shown  in  figure 
4.  A  dynamic  memory  was  chosen  over  static  designs  because  the  dynamic  RAM 
is  smaller  with  lower  power  consumption.  The  RAM  is  automatically  refreshed  as 
long  as  the  sample  rate  is  kept  over  1  KHz  because  every  location  is  both  written 
and  read  each  sample. 

Three  possible  choices  for  the  RAM  design  were  the  one,  three  of  four 
transistor  cells.  The  four  transistor  cell  was  chosen  over  one  transistor  designs 
to  minimize  process  sensitivity.  To  avoid  running  busses  between  the  RAM  and 
ALU,  it  was  desired  to  have  the  same  pitch  for  both  so  they  could  be  attached 
directly,  simplifying  automatic  layout.  To  use  space  efficiently,  this  required 


that  the  RAM  have  a  single  column  decode  as  the  optimized  cell  pitch  was 
approximately  hall  that  of  the  ALU  t>it  slice.  The  three  transistor  design  is  more 
difficult  to  column  decode  so  the  four  transistor  design  was  chosen. 

IV  LAYOUT  GENERATOR 

The  layout  generator  assembles  the  cells  into  a  data  path  and  a  controller 
block  from  hardware  descriptions.  If  desired  these  blocks  are  then  assembled 
into  a  complete  chip.  The  hardware  is  described  by  several  parameters  includ¬ 
ing:  data  path  word  width,  RAM  size,  decimation  ratio  and  ROM  contents. 

IV-l  Layout  Generation  Issues 

Before  starting  development  of  the  layout  generator,  several  aspects  of 
automated  layout  were  identified  as  difficult  problems.  General  placement  of 
the  major  circuit  blocks  requires  sophisticated  optimization  algorithms  to  gen¬ 
erate  space  efficient  designs.  A  two  level  router  would  be  needed  to  rout 
between  these  blocks.  An  extensive  data  base  would  be  needed  to  store  the 
necessary  data  for  the  router  and  placer.  The  data  base  would  contain  the  ter¬ 
minal  locations  on  each  block  and  the  available  routing  area. 

Other  aspects  of  the  automated  layout  were  found  to  be  easily  bandied 
problems.  It  is  not  difficult  to  assemble  blocks  (ie  the  ROM,  ALU  and  RAM)  from 
abutting  cells  since  the  relationships  involved  are  all  well  determined  by  the 
hardware  parameters  specified  by  the  user  and  the  cell  characteristics.  The 
way  the  cells  go  together  is  determined  by  the  cell  designer  so  that  proper  cell 
design  can  help  the  automated  layout.  For  example,  by  including  the  signal 
routing  within  the  cells,  the  need  for  inter-block  routing  by  the  program  is 
avoided.  Fixed  routing,  where  the  routing  terminals  have  a  constant  relation¬ 
ship  throughout  all  changes  in  hardware  parameters,  can  be  accomplished  by 
inserting  a  cell  with  the  appropriate  wires  in  it.  That  is.  no  algorithm  for  routing 


is  required  at  alL  Regular  routing,  where  the  routing  terminals  are  evenly 
spaced  throughout  changes  in  the  hardware  parameters,  is  implemented  by  a 
sample  program  loop. 

JV-Z  The  Floor  FTan 

In  order  to  avoid  the  more  difficult  problems,  two  major  restrictions  were 
made.  The  first  was  to  use  a  fixed  floor  plan,  the  relative  placement  of  circuit 
blocks,  pads  and  routing  areas  on  the  chip.  The  floor  plan  was  chosen  to  reduce 
the  complexity  of  the  algorithms  used  and  the  number  of  layout  decisions  that 
must  be  made  by  the  program.  With  the  chosen  floor  plan  all  routing  is  either 
fixed  or  regular. 

The  decision  was  also  made  to  have  the  program  'know  all’.  That  is,  all  infor¬ 
mation  regarding  the  cells  and  their  connection  was  coded  directly  into  the 
algorithms.  Using  specific  information  of  the  application  avoids  having  to  solve 
the  general  problem  and  reduces  the  software  design  time.  Software  reliability 
is  enhanced  when  the  simplest  algorithms  are  used  instead  or  complex  general 
algorithms  with  obscure  failure  modes.  This  approach  obviously  makes  the  pro¬ 
grams  very  specific  to  the  particular  cells  which  are  used  so  that  changes  in  the 
cells  may  require  changes  in  the  software.  Therefore,  one  should  not  expect  to 
make  major  upgrades  without  significant  software  changes  with  a  system  such 
as  this.  However,  because  the  software  development  time  is  relatively  small, 
new  software  can  be  written  when  significant  changes  are  made. 

IV-3  Examples  of  Generated  Circuits 

The  circuit  remains  easy  to  assemble  over  the  large  changes  in  hardware 
parameters  shown  in  figure  5a-d.  The  hardware  parameters  for  each  is  listed  in 
table  1.  From  the  figure  the  fixed  floor  plan  can  be  seen.  The  controller,  data 
path,  pads  and  routing  areas  are  always  in  the  same  relative  position.  The  1/0 


parallel  buss  at  tbe  top  of  the  chip  is  an  example  of  regular  routing.  The  routing 
area  does  not  change  shape  or  relative  position  as  the  parameters  change.  The 
routing  between  the  controller  and  the  data  path  is  a  function  only  of  the  RAM 
size  and  whether  decimation  is  used.  As  there  are  only  a  few  cases,  each  is 
treated  as  fixed  routing  and  a  cell  with  the  appropriate  wires  is  simply  inserted. 
'Wiring  from  the  PC  to  the  ROM  is  handled  similarly.  The  wiring  of  supplies  and 
clocks  requires  little  jumping  (except  in  the  fixed  routing  cells)  and  a  minimal 
amount  of  decision  making. 

The  silicon  area  is  also  used  efficiently  over  the  range  of  parameter 
changes.  Virtually  the  only  wasted  area  is  near  the  pads  or  due  to  differences  in 
length  of  the  controller  and  data  path  (see  figures  5b. 5c).  The  RAM  gets 
longer  as  the  number  of  states  in  the  filter  bank  increases.  The  controller 
increases  in  length  with  the  program  length.  Since  adding  states  requires  a 
longer  program  to  process  these  states,  the  ROM  and  RAM  tend  to  get  larger 
together.  In  figures  5a-c  the  ROM  is  not  column  decoded  and  the  waste  area  is 
not  too  large.  In  figure  5d  the  ROM  length  increased  significantly  so  that  a 
column  decoded  ROM  was  used  to  minimize  the  unused  space. 

Some  waste  of  space  is  allowed  if  the  waste  is  not  large  while  tbe  savings  in 
effort  is.  For  example,  when  decimation  is  used,  the  ROM  width  is  constant 
regardless  of  the  RAM  size.  Up  to  3  bits  of  RAM  are  unused  but  the  routing  is 
simplified.  The  data  buss  routing  area  between  the  data  path  and  pads  on  the 
right  side  of  the  chip  is  of  constant  size.  These  simplifications  reduce  the 
number  of  cases  to  be  handled  with  some  space  wasted  for  the  very  small  chips. 
However,  the  designs  would  likely  not  be  used  anyway,  due  to  the  large  overhead 
involved. 
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IV-4  Output  File  FbrmaL 

The  output  of  the  layout  generator  are  KJC  format  [3]  flies.  This  format  was 
chosen  because  layout  stations  are  being  used  which  read  this  format  making 
-visual  checks  convenient.  The  K3C  format  also  supports  the  hierarchical  organi¬ 
sation  of  the  hardware.  The  C!F  format  [4]  is  used  for  actual  fabrication  but 
does  not  allow  arrays  as  the  K1C  format. 

IV-6  Block  Assembly 

As  mentioned  previously,  the  assembly  of  blocks  from  cells  is  a  straight¬ 
forward  task.  An  output  file  is  written  that  lists  the  cells  with  the  appropriate 
offsets  and  orientation.  This  information  Is  calculated  from  the  hardware 
parameters  and  cell  parameters  (eg  size). 

The  controller  is  a  connection  of  many  cells  that  makes  its  manual  layout 
difficult.  Most  variations  in  the  controller  are  functions  of  the  ROM  width  and 
length  (found  from  the  binary  listing),  and  the  decimation  ratio  all  of  which  the 
user  specifies  explicitly.  The  ROM  length  determines  how  many  bits  will  be  used 
in  the  PC  and  decoder  and  how  the  decoder  is  programmed.  Since  there  are 
only  5  different  PC  sizes,  each  is  a  cell  with  appropriate  routing  wires.  The  deci¬ 
mation  ratio  determines  which  type  of  ROM  output  register  will  be  used  and  how 
the  index  register  itself  is  configured.  If  there  is  no  decimation,  all  output  regis¬ 
ters  are  the  same  and  no  index  register  is  used.  If  there  is  decimation  the  out¬ 
put  registers  that  feed  the  index  register  input  are  of  a  different  type  and  an 
index  register  must  be  included  and  programmed  to  decimate  properly. 

The  data  path  assembly  is  quite  simple  due  to  the  bit  sliced  nature  and  the 
small  number  of  blocks.  Tbe  entire  ALU  only  requires  one  line  in  a  IOC  flle  speci¬ 
fying  an  array  of  bit  slices.  The  entire  RAM  array  is  similarly  specified.  The  RAM 
decoder  can  be  generated  in  tbe  same  way  as  the  ROM  decoder  with  each  cell 


being  described  by  one  line  in  the  K1C  file. 

V  THE  FILTER  COMPILER 

The  filter  compiler  generates  hardware  descriptions  from  digital  filter 
descriptions.  This  allows  the  automatic  generation  of  filter  banks  with  virtually 
no  knowledge  of  the  final  hardware. 

V-l  Filter  Specification 

The  compiler  reads  on  input  file  specifying  the  filter  bank  organization  in 
terms  of  a  parallel  connection  of  channels.  Each  channel  is  a  cascade  connec¬ 
tion  of  sections.  Variations  on  this  format  are  allowed  that  have  been  found  use¬ 
ful  in  some  applications.  A  section  can  be  factored  out  and  used  by  different 
channels.  An  example  of  this  is  the  direct  form  band  pass  filter.  The  zeros  are 
the  same  for  all  channels  and  can  be  factored  out  and  computed  only  once.  Fig¬ 
ure  6  shows  an  example  of  a  filter  bank  organization.  In  this  example  there  are 
16  parallel  channels,  each  consisting  of  a  4th  order  BPF  section,  rectifier.  1  pole 
LPF  section,  decimation  by  B  and  a  2nd  order  LPF  section. 

Each  section  is  a  single  input,  single  output  structure  with  delays,  multipli¬ 
cations  and  additions.  Diagrams  of  some  of  the  currently  programmed  sections 
are  shown  in  figure  7. 

All  multiplies  defined  in  the  sections  use  fixed  coefficients  of  canonical 
signed  digit  format.  Use  of  this  format,  which  was  described  earlier,  optimizes 
the  usage  of  the  adder  by  minimizing  the  number  of  cycles  required  to  perform 
multiplication. 

There  are  several  optiors  allowed  in  each  section  in  the  bank.  The  user  can 
full  wave  rectify  the  input  or  any  section.  This  is  useful  in  spectrum  analyzer 
applications  Decimation  is  also  handled  but  in  a  somewhat  restricted  way.  A 
number  of  channels  can  have  their  outputs  decimated  and  modified  by  some 


specified  niter.  The  post  decimation  Alter  is  the  same  for  all  channels  being 
decimated  and  the  decimation  ratio  is  always  the  same  as  the  number  of  chan¬ 
nels  being  decimated.  Tljese  restrictions  were  applied  simply  to  reduce  the 
development  time  and  could  be  relaxed  in  future  systems.  The  user  can  also 
specify  that  the  output  of  any  section  be  sent  ofT  chip  through  the  parallel  buss 
while  setting  an  output  strobe.  To  implement  multiple  inputs,  the  input  of  any 
section  can  be  taken  from  any  channel  output  or  any  channel  input.  Being  able 
to  specify  a  channel  output,  allows  the  use  of  a  filter  by  many  other  channels 
(described  above)  and  really  allows  very  arbitrary  filter  organizations.  Nor¬ 
mally,  the  default  (no  specification)  results  in  the  parallel  channels  operating  on 
the  same  input  data. 

The  format  of  the  input  file  is  tailored  to  filter  banks  and  was  chosen  to  sim¬ 
plify  the  compiler.  The  format  is  as  follows: 

1.  Input  channels  (one  or  more) 

These  sections  receive  data  rrom  off  chip  and  may  perform 
some  filtering  (eg  zeros  of  direct  BPF). 

2.  Standard  channels  (one  or  more) 

These  are  just  the  regular  channels,  ie  some  cascade 
connection  of  sections.  These  channels  will  be  decimated 
if  a  decimation  channel  is  specified. 

3.  Decimation  channel  (optional) 

This  is  the  channel  that  operates  on  the  output  of  all 
standard  sections  above  after  decimation. 

4.  Non-decimated  Standard  channels  (optional) 

More  regular  channels  that  are  not  to  be  decimated. 

The  format  for  the  the  sections  is  as  follows: 

1.  Section  identifier  (2  letters),  <options,  if  any>.  N  coefficients 
Itoe  format  specifies  the  order  that  micro  code  is  generated  and  stored  in  the 

ROM  and  hence  the  order  that  it  is  executed.  This  save  the  compiler  from  having 

to  determine  this  information. 


V-2  tbe  Filler  library 


The  compiler  references  a  filter  library  which  contains  pertinent  informa¬ 
tion  about  the  allowed  sections.  A  file  contains  a  list  of  valid  section  identifiers 
along  with  the  number  of  memory  locations  and  coefficients  required  for  each 
•action. 

For  each  section  there  is  also  a  file  containing  the  macro  for  that  section. 
The  macro  file  contains  the  symbolic  micro  code  that  implements  a  section 
without  the  coefficients  or  options  inserted.  Symbolic  micro  code  is  just  a 
description  of  data  path  control  lines  that  have  been  grouped  functionally.  The 
symbolic  micro  code  has  fields  to  describe  the  following: 

memory  operation 

relative  memory  address  (actual  address  computed  by  compiler) 

barrel  shifter  input  mux  selection 

number  of  shifts  (constant  or  taken  from  input  file) 

adder  a  input  mux  select 

adder  b  input  mux  select 

i/o  operation 

Currently,  this  micro  code  must  be  written  by  hand  for  each  section.  This 
involves  a  detailed  knowledge  of  the  timing  and  architecture  that  the  average 
user  would  not  have.  Although  software  could  generate  the  micro  code  from 
difference  equations,  this  was  not  chosen  because  higher  performance  code 
could  be  generated  by  hand.  For  a  second  order  section  the  length  of  the  micro 
code  is  typically  only  B  words. 

¥3  Compiler  Operation 

The  operation  of  the  filter  compiler  is  shown  in  figure  B.  On  the  first  pass, 
the  input  file  is  checked  for  errors  and  hardware  requirements  such  as  RAM  size 
and  decimation  ratio  are  determined.  On  the  second  pass,  symbolic  micro  code 
for  the  entire  bank  is  generated.  The  symbolic  micro  code  is  then  compressed. 
Finally,  the  symbolic  micro  code  is  assembled  to  binary  micro  code. 


During  the  first  pass,  several  errors  are  checked  for,  the  amount  of  RAM  is 
determined  and  each  state  is  assigned  a  RAM  location.  The  error  check  will 
locate  syntax  errors,  undefined  sections,  filter  library  errors  or  the  use  of  the 
wrong  number  of  coefficients.  If  decimation  is  used,  the  decimation  ratio  is 
determined  by  counting  the  number  of  standard  sections  before  the  decimation 
section.  The  RAM  requirements  can  then  be  determined.  Without  decimation, 
the  amount  of  RAM  required  can  be  found  by  simply  adding  up  the  memory 
requirements  forwach  section.  With  decimation,  things  are  not  as  simple.  The 
index  register  supplies  the  high  order  RAM  address  lines  when  the  decimation  is 
performed  so  that  some  RAM  may  not  be  used. 

Amount  of  RAM  accessed  A=  (RAM  needed  for  input  and  standard  sections) 

+(RAM  needed  for  decimation  channel) 

•(number  of  decimated  channels) 

Amount  of  RAM  included  on  chip  B=  ^(intOog^fA-lJJ+l) 

Each  state  is  then  assigned  to  a  RAM  location.  The  states  are  assigned  to 
sequential  RAM  locations  as  they  are  encountered  in  the  input  file  if  there  is  no 
decimation.  That  is.  the  first  state  of  the  first  filter  is  stored  in  the  first  RAM 
location  while  the  last  state  of  the  last  section  is  stored  in  the  last  location.  With 
decimation,  the  states  accessed  by  the  decimation  filter  are  assigned  first  at 
fixed  increments.  The  remaining  states  are  then  filled  in  sequentially. 

The  symbolic  micro  code  for  the  entire  bank  is  generated  during  the  second 
pass.  To  accomplish  this  the  input  file  is  scanned  until  a  section  declaration  is 
found.  The  macro  for  that  section  is  read  from  the  library  and  expanded  into 
complete  micro  code  by  inserting  the  coefficients  and  options  from  the  input 
file.  Ibis  process  is  repeated  until  the  end  of  the  input  file  is  found. 

The  symbolic  code  generated  in  the  second  pass  is  compressed  by  looking 
for  sequences  of  code  that  can  be  shortened.  There  are  three  cases  that  are 
optimized.  First,  and  most  important,  is  the  performing  o:'  the  first  memory 


access  during  the  final  computation  of  tbe  previous  section.  This  appears  at 
nearly  every  section  boundary  and  utilizes  the  pipelining  of  the  data  path. 
Another  case  is  the  utilization  of  both  adder  inputs  when  tbe  accumulator  is 
empty.  Normally  the  B  input  is  zeroed  and  data  is  brought  in  through  the  A 
input  If  a  coefficient  has  certain  properties,  additional  data  can  be  brought  in 
through  tbe  B  input,  saving  one  cycle.  One  cycle  can  be  save  if  data  needed  for 
the  next  operation  is  found  to  be  left  in  tbe  adder  during  the  previous  calcula¬ 
tion.  This  occurs  for  certain  filter  structures  with  some  coefficients. 

These  optimizations  help  produce  code  with  essentially  the  same  efficiency 
as  that  done  by  hand.  For  a  speech  recognition  filter  bank,  tbe  number  of  micro 
instructions  was  reduced  from  480  to  364  words.  The  optimization  is  performed 
on  the  symbolic  code  because  tbe  cases  are  easier  to  identify  than  when  the 
code  has  been  assembled  to  binary. 

With  tbe  symbolic  code  optimized,  it  is  converted  to  binary  for  use  by  the 
layout  generator.  This  is  a  simple  operation  because  for  each  symbolic  field 
value  there  is  exactly  one  binary  pattern  for  one  or  more  control  lines.  This 
data  is  written  directly  to  a  file  for  the  layout  generator  or  tester  control  data  is 
included  for  use  by  a  real  time  tester. 

VI  THE  TESTER 

The  tester  set-up  in  shown  in  figure  9  is  quite  valuable  in  producing  designs 
which  work  the  first  time  fabricated.  The  filter  compiler  running  on  the  VAX  1 1* 
780  generates  tester  code  that  is  down  loaded  to  a  pattern  generator.  The  pat- 
tarn  generator  performs  exactly  the  same  function  as  the  controller  block 
Included  in  the  complete  chip.  It  sends  the  horizontal  control  words  in  real  time 
to  a  data  path  that  is  the  same  as  that  used  in  a  final  chip.  The  spectrum 
analyzer  generates  digital  input  data  and  examines  the  filler  outputs.  In  this 
way,  the  filter  designer  can  check  the  input  file  for  errors  and  the  effects  of 


Unite  data  word  with  and  coefficient  truncation  on  the  filter  responses.  If  there 
is  a  problem,  it  is  found  before  fabrication.  Further,  this  set  up  will  verify  that 
the  compiler  is  working  properly  and  that  the  filter  library  data  is  correct. 

VU  FABRICATED  CIRCUITS 

Several  circuits  have  been  fabricated  using  this  system.  A  single  band  pass 
Alter  chip  was  fabricated  to  determine  the  efficiency  of  a  small  chip.  A  16  chan¬ 
nel  filter  bank  for  the  front  end  of  a  speech  recognition  system  and  a  16  channel 
consumer  stereo  spectrum  analyzer  have  been  generated  and  fabricated.  Table 
2  gives  a  summary  of  the  performance  of  these  chips. 

All  circuits  have  been  fabricated  with  a  four  micron  NMOS  depletion  load 
process  and  are  designed  to  work  with  a  single  5  V  supply.  Although  a  3  MHz 
non-overlapping  clock  is  sufficient  for  the  chips  to  operate  at  the  designed  sam¬ 
ple  rates,  they  can  be  run  reliably  with  clocks  up  to  4  MHz. 

VII-1 A  Stan  all  Chip 

The  4  pole  band  pass  filter  chip  die  photo  is  shown  in  figure  10a  and  meas¬ 
ured  frequency  response  in  figure  11a.  Although  the  area  per  pole  for  this  chip 
is  quite  high  it  might  be  useful  when  data  is  in  digital  form  so  that  a  switched 
capacitor  or  other  analog  filter  would  not  be  appropriate  because  of  the  high 
overhead  in  including  the  A/D  and  D/A  The  circuit  shown  has  a  word  width  of  10 
bits  and  a  dynamic  range  of  48  dB.  Since  each  additional  bit  increases  the 
dynamic  range  B  dB,  a  chip  with  100  dB  of  dynamic  range  would  only  be  30  % 
larger. 

\II-2  A  Spectrum  Analyzer  for  a  Speech  Recognition  System 

A  block  diagram  of  112  pole  speech  recognition  system  [Sj  chip  is  shown  in 
figure  6.  Each  channel  consists  of  a  4  pole  Butterworth  band  pass  filter,  followed 
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by  a  full  wove  rectifier  and  the  first  pole  of «  3  pole  Butterworth  low  pass  filter. 

The  output  of  the  1  pole  anti-aliasing  filter  is  decimated  and  low  pass  filtered 
-with  the  rest  of  the  Butterworth  filter.  A  photo  of  the  die  is  shown  in  figure  10b. 

The  frequency  response  of  all  16  channels  is  shown  in  figure  lib. 

The  number  of  cycles  available  to  perform  all  filtering  is  given  by: 

number  of  micro  instr=  (number  of  processors)*(max  clock  rate)/(sample  rate) 
To  ensure  that  this  maximum  number  of  operations  was  not  exceeded,  several 

steps  were  taken.  Filter  structures  were  carefully  chosen.  The  state  variable 

form  shown  in  figure  7a  has  a  relatively  complex  structure  compared  to  the 

direct  form  (figure  7b).  However,  the  state  form  is  less  sensitive  to  coefficient 

truncation  than  the  direct  form  when  there  is  a  large  ratio  of  sample  frequency 

to  filter  band  edge  frequency.  For  low  frequency  filters,  the  insensitivity  to 

coefficient  truncation  makes  the  state  form  filter  more  efficient  than  the  direct 

form.  At  high  frequencies,  the  direct  form  becomes  more  efficient  due  to  its 

simpler  structure.  Therefore,  the  five  lowest  frequency  filters  are  state  form 

while  the  upper  eleven  are  direct  form.  To  save  more  cycles,  the  zeros  of  the 

direct  form  were  factored  out  of  each  channel  and  computed  only  once.  In  the 

state  form,  at  low  frequency  the  zero  at  1/2  the  sample  frequency  has  little 

effect  and  was  left  out. 

VI 1-3  A  Spectrum  Analyzer  far  Consumer  Stereo 

The  structure  of  the  16  channel  consumer  stereo  spectrum  analyzer  is  very 
similar  to  that  of  the  speech  recognition  chip.  The  sample  rate  was  increased  to 
20  KHz  to  allow  higher  frequency  filters  and  the  band  pass  filters  were  limited  to 
2  poles.  The  1/2  octave  filters  range  in  center  frequency  from  45  Hz  to  B  KHz. 

The  ratio  of  the  lowest  frequencies  of  interest  to  the  sample  rate  is  extremely 
small  (much  worse  than  the  speech  recognition  chip)  indicating  that  the  state 
form  will  be  better  at  lower  frequencies.  The  photo  of  the  die  is  shown  in  figure 
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10c  with  the  log-log  frequency  response  shown  in  figure  11c. 

The  design  of  this  chip  was  automated  one  more  level  than  the  others. 
Instead  of  specifying  the  digital  filters,  a  program  was  written  to  generate  the 
digital  filter  specifications  from  desired  3  dB  frequencies.  The  program  picked 
the  most  suitable  structure  and  determined  and  truncated  all  coefficients. 

VW  CONCLUSIONS 

The  tools  discussed  here  have  been  extremely  valuable  in  the  development 
of  the  circuits  that  have  been  fabricated.  These  tools  not  only  shortened  the 
hardware  design  time,  but  provided  testing  that  found  all  errors  before  fabrica¬ 
tion.  Minor  changes,  such  as  increasing  the  width  of  the  data  path  and  fine  tun¬ 
ing  the  gains  of  the  channels,  were  made  by  simply  editing  the  filter  description 
file.  Normally  this  would  be  a  tedious  task  prone  to  careless  mistakes.  By  care¬ 
ful  design  of  the  circuit  cells  and  restricting  the  applications  to  filter  banks,  the 
software  complexity  was  reduced  with  a  development  time  of  one  man-month. 


Thble  1  Parameter*  for  the  Circuits  of  Figure  5. 

circuit  5a  circuit  5b  circuit  5c  circuit  5d 


1  channel 

RAM  length  (words)  B 

data  path  word  width  10 

ROM  length  32 

Decimation  ratio 
number  of  processors  1 


B  channel  16  channel  16  channel 


64  64  64 

16  16  20 

128  128  192 

B  B  B 

2  2  2 


Thble  2  Performance  Summary  for  three  Circuits. 


data  path 
word  width 
size 

power  dissipation 
SIR 

number  of  poles 
sample  rate 


single 
4  pole 

10 

2.6mm  x2.5  mm 
280  ml 
46  dB 
4 

B4  KHz  (max) 


16  channel 
speech  recognition 

20 

7.2mm  x  3.7mm 
670  mW 
80  dB 
112 
14  KHz 


16  channel 
consumer  hi-fi 

20 

6.7mm  x  3.6mm 
570  mV 
B0  dB 
60 

20  KHz 
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Figure  7.  Filler  Structures  used  In  the  SJwcch  Recognition  Chip 
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9.  Test  System  for  Verifying  Algorithm « 


Figure  10e.  4  pole  single  RPK  chip  Die  Photo 
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Figure  10c.  Stereo  Spectrum  Analyser  chip  Die  Photo 
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A  digital  MOS-LSi  circuit  which  implements  a  full-duple.':  speech 
analysis/synthesis  system  will  be  reported.  This  vocoder  I.C.  analyzes  speech  in 
real  time,  gener  ating  a  low-bit-rate  digital  data  stream  suitable  for  transmission 
or  storage.  Simultaneously,  synthesized  speech  can  be  generated  from  an 
incoming  data  stream. 

Vocoders  transmit  two  types  of  information:  spectral  parameters  and  excita¬ 
tion  parameters.  The  vocoder  I.C.  uses  linear  predictive  coding  (L.P.C.)  to 
represent  the  spectrum.  The  excitation  is  represented  by  its  energy,  a 
voiced/unvoiced  decision,  and  the  period  of  the  pitch  fundamental. 

The  vocoder  I.C.  contains  three  processors,  each,  dedicated  to  a  particular 
part  of  the  algorithm  (Fig.  1).  Most  resources  are  devoted  to  the  spectra! 
analysis  and  pitch  tracker.  The  synthesizer  requires  relatively  little  processing, 
using  about  half  the  resources  of  Processor  Mo.  1. 

Each  processor  is  configured  with  a  word  length  an:  data  memory  size 
appropriate  to  its  function.  The  use  of  oarailel  processors,  pipelining,  and  bit- 
serial  communications  all  contribute  to  area  efficiency. 


Many  methods  exist  for  performing  an  L.P.C.  analysis.  Most  group  the  input 
data  into  blocks  (e.g.  129  samples)  from  which  parameters  are  extracted.  The 
alternative  is  the  adaptive  approach  which  updates  the  parameters  every  sam¬ 
ple.  Adaptive  methods  are  more  economical  since  there  is  r,o  need  to  buffer  e- 
block  of  input.  Aisc,  the  frame  rate  used  for  transmuting  tne  parameters  need 
not  be  tied  to  the  block  size  in  an  adaptive  approach. 

In  the  vocoder  I.C.  the  L.P.C.  analysis  is  performed  by  a  ten-stage  adaptive 
lattice  analyzer  [  1,2].  This  requires  both  an  ali-zero  ff.ter  (Fig.?)  and  a  correla¬ 
tor  (Fig. 3).  Tne  reflection  coefficients  k.-  are  forme  z  by  caking  the  normalized 
cross-correlation  of  the  signals  A*-!  and  9-.-i  Tliis  minimizes  the  energy  of  the 
outputs  of  each  stage,  in  the  frequency  domain,  the  filter  response  closely 
matches  the  inverse  of  the  input  spectrum. 

The  operation  of  the  lattice  analyzer  is  illustrated  in  Fig. 4.  The  data  was 
obtained  by  observing  the  data  bus  of  Processor  Ik.  1.  As  the  signal  passes 
through  the  ten  stages  of  the  lattice,  its  energy  uc creeses  and  its  specirum 
flattens.  Tne  output  of  the  final  stage  (the  residual)  ;s  essential. y  white.  All  use¬ 
ful  spectral  information  is  represented  by  the  ten  reflection  coefficients. 


Although  not  used  fer  low-bit-rate  transmission. 


dua  permits  use  of  the  vocoder  I.C.  in  higher  bit-rats  ?ys'.s::'s 


*  hich  encode  the 


residual. 

Among  the  advantages  of  the  lattice  approach  is  the  low  coefficient  sensi¬ 
tivity.  Short  (S-bit)  word  lengths  may  be  used  for  the  reflection  coefficients, 
reducing  the  hardware  required  for  multiplication. 

The  pitch  tracker  uses  Gold’s  algorithm  [3].  Peaks  and  valleys  in  the  low- 
pass  filtered  speech  are  detected.  The  current  input  sample  and  the  levels  of 
the  most  recent  peak  and  valley  are  combined  in  different  ways  to  form  six  sig¬ 
nals.  These  six  signals  are  fed  to  six  identical  pitch  detectors. 

Each  pitch  detector  attempts  to  time  the  interval  between  peaks  in  its  input. 
Detection  of  a  peak  is  followed  by  a  blanking  interval  during  w'hich  new  peaks  are 
ignored.  Smaller  peaks  whose  amplitudes  fail  to  exceed  an  exponentially  decay¬ 
ing  threshold  are  also  ignored.  The  threshold  level  is  derived  from  the  ampli¬ 
tude  of  the  last  peak  detected.  Counting  the  number  of  samples  between  peaks 
gives  an  estimate  of  the  pitch  period.  The  operation  of  one  of  the  six  pitch 
detectors  is  shown  in  Fig. 5,  obtained  from  Processor  Xo.  3.  The  six  pitch  esti¬ 
mates  thus  obtained  are  combined  with  a  "scoring"  algorithm.  A 
voiced/unvoiced  decision  is  also  made. 

The  complete  vocoder  algorithm  is  encoded  in  6  kbit  of  microcode.  A 
sequencer  containing  the  microcode  ROM  controls  the  three  processors.  Certain 
operations  which  can  not  be  efficiently  microcoded  are  implemented  by  spe¬ 
cially  designed  circuits.  An  example  is  the  squaring  operator  used  in  the  corre¬ 
lator  (Fig. 3).  which  is  done  by  table  look-up.  The  following  circuit  sections  are 
identified  by  number  in  Fig.  6: 

(1)  Excitation  source  for  synthesizer 

(2)  Address  indexing  unit  for  Processor  Xo.  1 

(3)  FIFO  buffer  for  reflection  coefficients 

(4)  Squaring  circuit 

(5)  Address  indexing  unit  for  Processor  Xo.  3 

(6)  Decision-making  logic  for  pitch  tracker 

The  circuit  is  fabricated  in  a  4u  XMOS  process,  containing  23,000  transistors 
on  a  .265"  by  .225"  die.  The  circuit  requires  a  2.39  MHz  clock  and  dissipates  600 
mW. 

T  Research  supported  in  part  by  Defense  Advance  Fiesearch  Projects  Agency, 
Contract  Xo.  MDAS03-79-C-0429. 

*  Currently  with  the  Norwegian  Defence  Research  Establishment. 
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Fig.l  Signal  flow  charl  for  single-chip  vocoder 
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Fig. 2  Flow  chart  for  the  analysis  lattice  filter 
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Fig. 3  Flow  chart  for  the  correlator 
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ABSTRACT 

The  root  LPC  system  has  closed  form  analysis  and  formant  like 
synthesis  structure  By  using  quadratic  coefficient  quantisation  and 
section  repeat  its  data  rate  can  be  lower  than  1Kbps.  By  including 
representative  residua)  signal  of  variable  repetition  rate  its  quality  can 
be  continuously  improved  A  special  purpose  NMOS-LSI  chip  was 
built  to  implement  the  synthesis  function 


INTRODUCTION 

The  root  LPC  is  an  extension  of  LPC  and  has  a  formant  like 
structure.  It  has  the  closed  form  analysis  as  in  LPC  and  can  reach  a 
very  low  data  rate  as  io  formant  synthesis.  The  prediction  residual 
signal  is  included  in  synthesis  for  better  speech  quality  as  in  the 
RELP  (Residual  Excited  Linear  Prediction)  system.|l|  By  properly 
processing  the  synthesis  titer  parameters  and  the  residual  signal,  vari¬ 
able  data  rates  with  a  wide  range  of  speech  quality  can  be  obtained. 

In  LPC  system  the  reflection  coefficients  have  been  named  the 
'best'  set  for  coding  in  terms  of  Alter  stahility  upon  quantitation  and 
inherent  ordering  of  the  coefficients  |2| 

Filter  Stahility 

The  roots  of  A(t).  the  prediction  polynomial,  can  also  guarantee 
the  stahility  of  the  synthesis  titer  1/A(i)  by  always  residing  iaside  the 
nnit  circle  in  the  Z-plane. 

Parameter  Inherent  Orderiny 

The  inherent  ordering  of  parameters  in  good  for  parametering 
signal  since  the  parameter  positions  are  also  nsed  for  carrying  infor¬ 
mation.  But  a  good  aaalyter  should  be  able  to  decompose  the  speech 
into  basic  features  which  are  represented  by  unrelated  parameters  or 
groups  of  parameters.  Then  it  would  be  straightforward  to  emphasize 
or  de-emphasize  any  feature  without  side  effects  on  others  in  recon¬ 
structing  the  speech.  From  this  point  of  view  the  inherent  ordering  of 
all  parameters  it  not  really  a  good  thing. 

The  roots  of  A(z).  like  the  formants,  do  not  possess  an  inherent 
order.  If  according  to  tome  characteristic  measurement  the  root  pairs 
were  put  in  order,  the  advantage  of  inherently  ordered  parameters 
could  be  regained. 


ROOT  LPC  SPEECH  SYNTHESIS 

The  Irst  work  for  stepping  from  straight  LPC  to  root  LPC  it  to 
And  the  roots  of  the  LPC  polynomial 

A(z)-1  +  S,!-1  +  SjT’1 -f  ••  +  sMz’“ 
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-  ri(  1  +  +  C,Z  J  ) 

tUMl 

where  a,'s.  l<t<.tf.  the  prediction  coefficients,  are  all  real 
Bairstow’s  method  is  the  most  popular  numerical  method  for  finding 
the  roots  of  a  polynomial  with  real  coefficients  without  using  complex 
arithmetic.]3] 

Filter  Memory  Effect  ot  Frame  Boundarier 

In  the  root  LPC  synthesizer  if  the  quadratic  factors  were  not 
well  ordered,  some  sections  could  have  a  large  coeOicieni  change  from 
frame  to  frame  In  this  case  the  waveform  discontinuities  at  frame 
boundaries  may  be  serious  (4]  An  example  of  this  memory  effect  is 
shown  in  Fig  I. 

In  Fig  1  there  are  two  and  a  half  frames  of  synthesized  speech. 
Each  frame  contains  20G  samples,  which  are  equal  to  2  pitch  periods 
of  speech  The  filter  coefficients  for  them  are  all  the  same  except  fac¬ 
tors  #1  and  #3  are  interchanged  in  the  frame  of  samples  400-7)9 

To  minimize  this  memory  effect,  the  filter  input  should  be  rela¬ 
tively  larger  than  its  memory  In  cascade  form  6lter  the  input  to  the 
quadratic  section  is  the  output  of  the  preceding  section  Therefore 
the  preceding  section  with  a  relatively  lower  decay  rate  is  expected 

The  ordering  scheme  based  on  the  decay  rate  is  only  to  minim¬ 
ize  the  memory  effect,  but  sometimes  it  is  still  audible.  Experiments 
showed  that  'zeroing  the  memory'  at  the  frame  boundaries  did  not 
introduce  any  audible  discontinuities.  Therefore  the  'memory  zero¬ 
ing"  method  was  taken  to  fix  the  memory  effect  problem  And  the 
factor  ordering  will  be  used  exclusively  for  the  quantization  and 
repeat  algorithm  which  will  be  discussed  next. 


Figure  1:  Memory  Effect  for  Improper  Factor  Ordering 


DATA  COMPRESSION 

Parameter  quantization  and  section  repeat  are  used  for  data 
compression.  Both  of  them  depend  heavily  upon  the  ordering  of  the 
quadratic  factors.  A  good  factor  ordering  scheme  makes  the  section 
functional  variation  across  frame  boundaries  minimal  so  as  to  reduce 
memory  effect  and  benefit  parameter  quantization  and  repeat 
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Pole  Concentration  Region* 

Suppose  the  pcle  concentration  regions  for  the  quadratic  actions 
arr  located  already  The  factors  are  ordered  by  maximizing  the 
number  of  matches  between  the  quadratic  factors  and  the  concentra¬ 
tion  regions  [G|  A  match  means  the  roots  of  the  factor  are  within  the 
assigned  region.  The  factor  assigned  to  region  #1  is  put  in  section  1 
of  the  cascade  filter  and  so  on. 

The  concentration  regions  can  be  derived  by  a  training  process, 
which  may  be  replaced  by  a  fixed  assignment  for  simplicity.  The 
region  assignments  for  LPC-12  are  (1)  {  0.6<r<l.  0<f<08  }.  (2)  { 
0 6<r <*1 .  05<ff<16  }  (3)  {  06<r<l.  08<S<20  (.  (I)  j 
0.6<r<  1  l.S<«2.4  (.  (5)  {  0.6<r<l.  1  8<S<3.14  },  and  the  real 
axis  (G)  {  0 6<r <  1 .  2.#<«<3.H  (.  {  0<r<0.8.  0<ff<3. M  ),  and 
the  real  axis  where  r  is  the  radius  and  S  is  the  angle  from  the  real 
axis  in  the  Z-plane  They  have  overlapped  areas  between  neighboring 
regions  Only  the  regions  for  voiced  frames  are  listed.  The  unvoiced 
frame  using  LPC-4  has  only  two  factors  They  do  not  need  to  be 
ordered  by  matching.  Also  only  the  regions  above  the  real  axis  are 
specified  here  because  all  pole  locations  are  symmetrical  about  the 
real  axis. 

In  an  experiment  for  30.81  seconds  of  male  and  female  speech, 
there  are  less  than  3rc  unmatched  sections  for  these  region  assign¬ 
ments. 

Quadratic  Coefficient  Quantisation 

A  short  code  word  is  used  for  matched  sections  and  extra  bits 
are  used  for  the  unmatched  ones  Suppose  there  are  .V,  bits  allocated 

to  the  short  code  word  for  the  i-th  section  One  code  is  reserved  for 

s 

the  unmalch  Bag.  Then  2  -1  code  points  are  distributed  in  the  i-th 
section  region.  For  unmatched  sectioos,  a  long  code  word  is  indicated 
by  all  I  s  in  these  A',  bits,  and  the  following  Af,  more  bits  are  used  for 
coding  the  unmaiched  coefficients.  Since  the  unmatched  sections  hap¬ 
pened  only  with  low  probability,  the  average  number  of  bits  for  the  i- 
th  section  is  only  slightly  more  than  .V,. 


The  number  of  bits  used  are  listed  below,  where  i— 7.8  are  for 
the  unvoiced  frames  [6| 


Quadratic  Section  Repeat 

In  root  LPC  parameter  repeat  eaa  be  applied  to  individual  sec¬ 
tions  instead  of  the  whole  parameter  set.  Suppose  there  are  N  frames 
and  N  sets  of  quadratic  coefficients  available  for  some  specific  section. 
The  quadratic  coefficient  sets  Q,  is  derived  by  analyzing  the  speech 
waveform  in  frame  #i.  In  Figs.  2  the  horizontal  axis  is  the  frame 
number  and  the  vertical  axis  is  for  the  Q,'$.  The  main  diagonal  from 
the  bottom  left  to  the  upper  right  corresponds  to  the  case  without 
repeat.  An  example  of  some  forward  and  backward  repeats  are  plot¬ 
ted  in  Fig.  2,  where  Q (  is  used  for  frames  #1  and  #2.  Q t  for  frames 
#4  and  #5  aod  Q$  for  frames  #7  .  #8  and  #9.  Thus  only  7  sets  of 
parameters  are  needed  for  1 1  frames  in  this  case 


Figure  2:  Coefficient  assignment  with  section  repeat 


The  goal  of  section  repeal  is  to  use  minimum  number  of  Q  slw 
all  frames  under  some  quality  deviation  allowance.  The  quality  devia¬ 
tions  for  all  points  on  the  graph  ran  be  railed  the  local  errors  It  is 
assumed  that  Q  is  the  best  set  for  the  i-th  frame  and  without  qualuv 
deviation  Then  for  all  points  on  the  main  diagonal  the  local  error- 
are  zero  For  off-diagonal  points  |i.j).  i  jd  j  which  means  Q.  replaces 
Ql  for  the  i-th  frame,  some  quality  deviation  is  introduced  All  point- 
on  a  column  with  local  errors  less  than  a  threshold  indicate  the  associ¬ 
ated  Q,  s  could  be  used  in  that  frame  under  allowable  quality  devia¬ 
tion. 

The  extra  bits  needed  for  updating  the  coefficients  is  called  the 
transition  cost  The  transition  cost  is  zero  if  no  updating  is  needed, 
which  corresponds  to  slaying  on  the  same  row  for  the  next  column 
To  synthesize  the  whole  speech  with  minimum  number  of  Q  s  means  a 
path  from  the  left  to  the  right  on  the  graph  with  minimum  number  of 
rows.  The  dynamic  programming  technique  can  be  used  to  search  for 
the  optimal  path  (7| 
dynamic  Programming 

For  each  column  on  the  frame-parameter  plot  calculate  the 
local  errors  for  all  points  If  ihe  local  error  is  greater  than  the  thres¬ 
hold.  any  path  passing  this  point  is  inhibited,  otherwise  the  point  is 
available  to  all  candidate  paths  to  pass  through 

For  each  path,  its  cost  and  error  are  defined  as 
path  cost  w  ^(Iranfilion  cert) 

Ml* 

path  error  *  error) 

r  sis 

There  could  be  many  paths  from  Ihe  preceding  column  to  each  avail¬ 
able  point.  Only  that  path  with  minimum  cost  needs  to  be  saved  so  if 
more  than  one  path  has  the  same  cost,  the  one  with  minimum  path 
error  is  saved. 

This  procedure  is  repeated  for  all  columns  from  the  left  to  the 
right  and  the  path  terminated  on  the  last  column  with  minimum  cost 
or  error  is  taken  Then  by  back  tracking  from  the  right  to  the  left 
the  best  scheme  for  section  repeat  is  obtained 

Different  thresholds  ran  be  set  for  the  sections.  Lower  thres¬ 
holds  were  set  for  the  formant  sections  and  higher  thresholds  for  the 
non-formant  sectioos. 


SPEECH  QUALITY  ENHANCEMENT 

While  the  unvoiced  sounds  are  important  for  syllable 
identification,  the  vowel  sounds  dominate  the  speech  quality.  To 
enhance  the  speech  quality  without  paying  the  full  price  in 
bandwidth,  only  the  residual  signal  for  voiced  sound  is  stored  or 
transmitted. 

Repreaentatire  Reaiduai  for  V'oiced  Framrt 

The  single  pulse  in  the  pulse  tram  for  the  excitation  for  the 
voiced  sound  could  be  called  the  ’aochor  pulse*  In  addition  to  the 
anchor  pulse,  the  residual  signal  has  many  ’residual  puises*  for 
adjusting  local  prediction  error  with  the  original  waveform  There  are 
two  ways  in  using  part  of  the  residual  pulses  One  is  to  add  the  most 
effective  pulses  first. (8|  Another  way  is  to  add  them  one  right  after  the 
other  from  the  anchor  pulse  until  the  next  one 

Experiments  showed  that  the  residua)  pulses  must  he  dense  and 
long  enough  for  speech  quality  enhancement  For  partial  quality 
enhaneement  a  representative  residual  of  a  full  pitch  period  ts  used  for 
tome  continuous  frames  The  highest  quality  with  highest  data  rate 
needs  one  residual  per  frame. 

Target  Waveform 

For  each  frame  one  pitch  period  of  representative  residual  pul-es 
is  used  for  quality  enhancement  The  original  speech  waveform  of  one 
pitch  period  which  would  be  regenerated  wuh  the  residual  pulses  is 
called  the  "target  waveform*  The  rest  of  the  waveform  in»a  frame 
will  be  thought  as  replicas  of  the  target  waveform  with  minor  devia- 


ttou*  Experiment*  showed  the  speech  with  ail  voiced  frame*  filled 
with  their  target  waveform*  sound  close  to  the  original 

The  target  waveform  is  located  by  cross  correlating  the  impulse 
response  of  the  synthesis  filter  and  the  pre-emphasited  original 
waveform  The  waveform  segment  having  the  maximum  cross  corre¬ 
lation  is  taken  a*  the  target  waveform. 

Derivation  of  Repreoentotire  ft  endual 

If  a  representative  residual  for  N  frames  is  desired,  there  are  N 
target  waveforms  to  be  regenerated  with  N  sets  of  parameters  and 
gam  but  only  one  residual  Each  residual  pulse  will  be  used  for 
adjusting  N  local  prediction  errors 

Suppose  the  y-th  sample  of  the  i-th  target  waveform  «„  it 
predicted  from  the  preceding  samples  as  »  and  adjusted  by  the  resi¬ 
dual  pulse  f,  r, .  where  },  is  the  gain  factor  for  frame  #i.  then  their 
error  energy  E,  is 

E,  “  E(»o  -  V  -  l.',f  ■ 

To  find  the  minimum  of  E;  by  varying  r,,  its  derivative  is  set  to  tero 


This  procedure  is  repeated  from  the  anchor  pulse  for  all  residual 
pulses. 

The  algorithm  developed  above  for  the  representative  residual 
pulse  is  good  for  any  value  of  N.  the  number  of  frames.  The  larger  N 
is.  the  less  there  is  the  ability  of  local  error  adjustment  by  the  resi¬ 
dual  pulse.  Therefore  when  the  residual  repetition  rate,  1/N. 
increases  the  synthesized  speech  quality  also  increases 


Figure  3:  Multirate  Root  LPC  analyzer  System 


Figure  4:  Multirate  Root  LPC  Synthesizer  System 


THE  MULTIRATE  ROOT  LPC  SYSTEM 

The  multirale  root  LPC  aaalyier  and  synthesizer  are  depicted 
in  Figs.  3  and  4.  The  multirate  capability  is  accomplished  with  three 
independent  processes  : 

1.  quantization  and  its  code  books, 

2.  section  repeat  and  its  thresholds,  and 

3.  representative  residual  and  it*  repetition  rate 

A  subjective  speech  quality  test  and  a  diagnostic  rhyme  test  for 
the  syntheses  of  diferent  data  rates  were  take*  by  ten  people  to  check 
the  quality  and  intelligibility  variation. |9|  The  subjective  quality  test 
consists  of  five  sentence*  spoken  by  male  and  female  speakers  with 
wide  range  of  pitch  period.  These  sentences  are 

1.  This  is  a  demonstration  of  synthetic  speech,  (male) 

2.  Please  fasten  your  teat  belt,  (female) 

3.  Dr  Bob  is  on  vacation  again  (male) 

4.  Did  you  exercise  in  the  last  hour*  (female) 

5.  Synthetic  speech  may  be  useful  in  many  consumer  applications 
(male) 

For  each  senteace,  five  different  syntheses  were  made  and  played 
pairwise  for  quality  comparison  These  five  syntheses,  all  of  which 
were  aatomatieally  performed,  are  :  (1)  Quantized  root  LPC  with  sec¬ 
tion  repeat,  (2)  Quantized  root  LPC,  (3)  Root  LPC,  (4)  Root  LPC 
with  one  representative  residual  pet  Ive  voiced  frames,  and  (5)  root 
LPC  with,  one  representative  residual  per  two  voiced  frames.  The 
listeners  were  asked  to  make  one  decision  from  (a)  apparently  lower, 
(b)  a  little  lower,  (c)  the  same,  (d)  a  little  higher,  and  (e)  apparently 
higher  in  quality  for  the  second  synthesis  compared  to  the  first  one  of 
each  pair.  The  scores  for  the  relative  quality  are  -2,  -I,  0,  +  1,  and 
+  2  for  these  answers  respectively. 


The  average  data  rates  for  the  above  five  syntheses  are  964. 
1443.  6294,  8577.  10930  bits  per  second  respectively  for  the  above  five 
sentences.  The  subjective  quality  test  results  are  shown  in  table  and 
Fig.  5. 
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la  the  plot  the  vertical  scale  is  the  overall  relative  quality  scores  from 
ten  people.  The  plot  showed  that  many  people  felt  the  synthesized 
speech  quality  is  increased  with  the  data  rate. 


The  results  of  the  diagnostic  rhyme  test  for  the  five  data  rates 
are  listed  below 
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There  is  only  a  little  intelligibility  loss  for  decreasing  the  data  rate 


CHIP  OPERATION  AND  ARCHITECTURE 

The  implemented  synthesizer  chip  is  basically  a  linear  time- 
varying  cascade  quadratic  all-pole  filter  It  has  programmable  6lter 
coefficients  The  excitation  received  will  be  6ltered  with  the  stored 
coefficients  and  then  sent  out  with  one  clock  cycle  delay  Tbe  filter 
coefficient  and  excitation  inputs  and  tbe  speech  sample  output  are  all 
through  the  same  parallel  I/O  port.  Thus  a  line  to  differentiate  Ihe 
input  data  as  being  coefficients  or  excitation  is  required  Also  a  tirst- 


1.11.3 


T.V 


4'.  .  V  *v  "1 


i— • 


•  '»»W 


0  2  4  S  8  10(Kbps) 


Figure  S:  Synthesiied  Speech  Quality  versus  Data  Rate 


ia-lnt-oet  (FIFO)  data  buffer  ia  ned  for  asynchronous  data  iaput. 
Thia  chip  ia  controlled  by  a  master  mieroproceaaor  iAPX18fi.[10|  the 
system  Mock  diagram  ia  shown  ia  Fig.  6. 

The  ayntheaiier  ia  triggered  to  (tart  processing  for  one  (ample 
by  aa  inverted  pulae  on  the  START  pin.  Fint  it  reada  in  the  excita¬ 
tion.  multiplies  it  with  gain  factor,  aad  de-emphaeitea  aad  outputs  the 
apeech  aample  from  laat  clock  cycle.  Then  the  excitation  it  lltered 
with  the  stored  coefficients  aad  memory  state  values.  If  the  UPDATE 
ia  high  voltage,  the  updating  gain  factor  aad  12  coefficients  are  read 
ia  through  the  data  port  to  overwrite  the  stored  values.  The  liter 
memory  is  clewed  if  the  UPDATE  is  high  After  all  the  six  quadratic 
sections  had  been  performed,  the  synthesiier  responds  a  STOP  pulse 
aad  waits  for  next  START  pulse. 

The  (N  and  OUT  are  naed  to  synchronise  data  inpat  aad  output 
through  the  18-bit  data  port.  DAV  indicates  whether  datum  is  avail¬ 
able  in  the  FIFO.  For  NMOS  circuits.  5V  voltage  is  required  across 
VDD  aad  GND.  A  nouoverlapped  two  phase  clock  is  applied  to  Phi 
sad  Ph2. 

Internally  the  chip  consists  of  aa  18-bit  signal  processor 
block, (11]  32-word  RAM.  program  connter,  24  bits  X  90  words  pro¬ 
gram  ROM.  RAM  address  indexing  aad  a  random  logic  circuit. 


SUMMARY 

A  root  LPC  speech  synthesiier  with  a  variable  bit  rate  has  beet 
built  using  a  special  parpose  NMOS-LS1  circuit.  The  data  rate  caa  be 
from  less  than  1Kbps  to  greater  thaa  10Kbps  aad  aay  value  ia 
between.  The  output  speech  quality  is  determined  by  the  desired 
iaput  data  rate. 

The  root  LPC  speech  synthesis  is  a  combination  of  fonnaat  aad 
LPC  systems.  The  LPC  polynomial  is  factored  into  quadratic  factors. 
These  factors  are  ordered  by  optimally  matching  their  pole  locations 
with  overlapped  regions  in  the  Z-plaae.  The  ordered  factors  are  then 
used  as  the  stages  of  a  cascade  form  formant  synthesiier.  The 
memory  elect  is  Ixed  by  teroing  the  memory  at  frame  bouudwies. 

To  lower  the  bit  rate  without  losing  the  intelligibility,  quadratic 
coefficient  quantitation  based  on  a  perceptual  measurement  aad  a  sec¬ 
tion  repeat  algorithm  based  on  dynamic  programming  are  used.  To 
upgrade  the  speech  quality  in  terms  of  naturalness  aad  smoothness, 
representative  prediction  residuals  are  used  as  the  synthesis  (Her  exci¬ 
tation  for  the  voiced  frames.  By  adjusting  the  maximal  allowable 
apectral  distances  for  section  repeat  and/or  the  repetition  rate  of  the 
representative  residuals,  a  continuum  of  bit  rates  with  varying  speech 
quality  eaa  be  achieved. 

A  chip  was  designed  to  implement  this  synthesiier.  It  uses  a 
macrocell  design  approach  aad  requires  16  2  square  milimeters  of  die 
area.  For  lOkHs  sampling  rate  H  runs  with  a  master  clock  of  3.4MHx. 


Fignre  6'  System  Block  Diagram  of  the  Synthesiier 
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Abstract 

The  physics  of  several  hot-electron 
currents  and  their  impact  on  IC  performance 
and  reliability  is  reviewed.  Vd  -  is 

emphasized  as  the  driving  force  of  all  hot- 
electron  effects.  V  and  L  affect  the  hot- 
electron  effects  only  through  their  influence  on 
Ku-  A  simple  hot-electron  scaling  rule  is  to 
scale  Vd.  Vt.  and  y/x„Xj  in  proportion  to  L 
Several  proposed  structural  changes  should 
provide  considerable  relief  to  the  hot-electron 
problem 


Introduction 

1974  saw  the  introduction  of  5V  IK  static 
RAMs  having  6pim  channels  and  1200A  gate 
oxide.  Today's  5V  RAMs  often  have  channel 
lengths  shorter  than  1  5/^m  and  gate  oxides 
thinner  than  250A  The  power  supply  voltage 
has  escaped  scaling  for  reasons  of  compatibility 
with  existing  systems  and  circuit  speed  and 
margins  Voltage  are  even  higher  in 
bootstrapped  circuits  and  special  high  voltage 
circuits  such  as  EPROMs  Even  after  relief 
comes  in  the  form  of  lower  power  supply 
voltages,  the  same  reasons  cited  above  plus  the 
imperfect  control  of  threshold  voltages  and  the 
nonscalability  of  subthreshold  IV  characteris¬ 
tics  will  continue  to  peg  the  power  supply  vol¬ 
tage  near  unacceptable  values 

As  a  result,  the  hot-electron  effects  have 
caused  concerns  for  and  actual  occurrences  of 
circuit  failures,  performance  losses,  and  relia¬ 
bility  problems  This  paper  attempts  to  clarify 
the  conceptual  models  of  the  effects  and  sug¬ 
gests  some  rules  and  tools  that  may  simplify 
the  visualizing,  characterization,  and  scaling  of 
the  hot-electron  effects  It  is  not  an  exhaustive 
or  balanced  survey  of  the  literature  on  this 
subject  The  discussion  will  be  devoted  to  N- 
channel  MOSFETs  The  hot-carrier  effects  are 
weak  enough  in  PMOS  not  to  cause  immediate 
concern 


Hot-Electron  Currents  and  Their  Impact 
The  term  "hot-electron  effects"  often  refers 
only  to  the  phenomenon  of  device  degradation 
due  to  channel  hot-electron  injection  [l]  A 
clearer  and  more  useful  picture  may  be 
obtained  by  defining  it  as  all  the  hot-electron 
currents  described  below  and  their  impact  on 
circuit  performance  and  reliability 

Referring  to  Fig.  1,  the  substrate  current, 
Itub'  results  from  the  hole  generation  by  the 
channel  hot  electrons  through  impact  ioniza¬ 
tion.  If  the  impact  ionization  coefficient  is 
Atexp(-5i/£’)  ,  where  E  is  the  electric  field. 
Itui  can  be  shown  to  be  [2] 

Bt  -1.7»10« 

/***=  C,/de  *2fde  U  (1) 

E„  in  V/cm  is  the  maximum  channel  electric 
field,  i.e.,  the  field  at  the  drain  end  of  the  chan¬ 
nel.  Excessive  can  overload  on-chip 

substrate-bias  generators  The  potential  (ohmic 
voltage)  variations  in  the  substrate  produced 
by  the  flow  of  / ^  can  cause  V(  variations,  and, 
in  severe  cases,  snap-back  (avalanche)  break¬ 
down  of  the  MfTFET  [3]  or  latch-up  in  CMOS  cir¬ 
cuits  [4j. 

The  gate  current.  Ig .  has  a  similar  theoreti¬ 
cal  dependence  on  Em  (V/cm)  [2], 


Ie  *  Cz{EM)!de  Vf>Vd  (2) 

En  =('lg  -  Vd)/zox  >s  the  oxide  field  near  the 
drain  where  CziE^)  increases  from  about  10-3 
at  Egg^O  to  about  4xl0“3  at  E„~  10s  V/cm  A  is 
the  channel  hot-electron  mean-free-path  and  is 
about  78A  [5);  ipb  is  the  St/  St£?z  barrier  height 
in  volts  modified  by  barrier  lowering  [6].  tft,  is 
about  3.1V  at  Eos~ »0  and  about  2  5V  at  £„  =  10® 
V/cm.  The  exponential  term  is  the  probability 
for  an  electron  to  gain  more  energy  than  qift 
without  suffering  a  collision.  Cz  is  the  probabil¬ 
ity  for  an  energetic  electron  to  be  injected  into 
the  oxide.  Some  other  combinations  of  A  and 
Cz  values  can  also  fit  the  Ig  data  well  The 
injected  electrons  may  be  trapped  in  the  oxide, 
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causing  Vt  drift,  or  generate  interface  traps, 
causing  degradations  in  electron  mobility  and 
subthreshold  characteristics,  and  additional  Vt 


shift  [7], 

Minority  carrier  (electron)  current,  7n,  is 
known  to  be  released  into  the  substrate  from  a 
MOSFET  operating  in  the  saturation  region. 
These  electrons  were  mistakenly  believed  to  be 
the  product  of  secondar'  impact  ionization  [4], 
which  is  in  fact  many  oi  ders  of  magnitude  too 
small  to  explain  /„  [8]  Two  mechanisms  are 
responsible  as  indicated  in  Fig.  1.  When  1^  is 
low,  the  dominating  mechanism  is  photo¬ 
carrier  generation  where  the  photons  are  pro¬ 
duced  by  the  hot  channel  electrons  [9]. 
Bremsstrahlung  is  believed  to  be  the  photon 
generation  mechanism  and  the  probability  for  a 
channel  electron  to  generate  a  photon^  with 

energy  hu  is  proportional  to  exp f 


„ The 

total  rate  of  photon  generation  translate  into 
[1C] 


-13n/  -1.6SX108 

/„  *6xlO~*Ide  qXEm  -  6xicrs/de  ^  (3) 


wh;re  1.3e  V  may  be  considered  as  the  average 
energy  of  the  photon  spectrum.  If  the  sub¬ 
strate  resistivity  is  high  and  at  large  enough 
the  source-substrate  junction  may  be 
sufficiently  forward  biased  such  that  /„  is  dom¬ 
inated  by  the  electron  injection  from  the 
source  [8,11].  This  second  mechanism  is  also 
depicted  in  Fig.  1.  Once  the  electrons  are 
deposited  in  the  substrate,  by  either  mechan¬ 
ism,  they  may  be  collected  by  nearby  nodes  as 
excess  leakage  currents.  These  leakage 
cuiTents,  IeoU  in  Fig.  1,  are  known  to  cause 
DRAM  refresh-time  degradation  [8]  and  can 
discharge  other  charge-storage  nodes  or  even 
low-current  carrying  nodes  [4].  A  portion  of 
the  photon  spectrum  has  large  penetration 
depth  in  Si.  Ietg ,  therefore,  varies  with  the 
separation  between  the  collecting  node  and  the 
culprit  MOSFET  in  accordance  with  a  long 
effective  decay  length  of  about  800 ftm  (when 
photo-carrier  generation  is  the  dominating 
mechanism)  as  shown  in  Fig.  2. 

Hot  electrons  also  indirectly  affect  the  IV 
characteristics.  When  I^Rn*  *  vrub  +  0.6  V, 
significant  current,  /,  in  Fig.  1,  may  be  Injected 
from  the  source  to  the  drain,  in  addition  to  the 
gate-controlled  surface  current,  causing  the 
familiar  rise  in  the  1-V  curve  [ll].  When  the 

condition  M  (multiplication  factor)  >  — - —  is 

also  satisfied,  often  at  a  yet  higher  Vd,  snap- 
back  breakdown  occurs  [3]. 


Electron  Temperature  versus  Field 
Assuming  an  isotropic  hot-electron  gas  at 
temperature  Tt ,  one  expects  the  gate  current 
to  be  proportional  to  exp(-q^b/  kTt)  [5  12] 
By  fitting  the  !g  data  to  numerically  simulated 
[12,13]  or  analytically  modeled  [5]  channel 
electric  field,  these  studies  have  independently 
concluded  that 

T,  ( Kelvin )  «=  T^xiO-3^  (V/cm)  (4) 

With  this  relationship  the  gate  current  essen¬ 
tially  has  the  form  Ig  =  C3  exp {-<pb/ &Em),  simi¬ 
lar  to  Eq.  2. 


Correlations  Among  the  Hot-Electron  Currents 
By  eliminating  from  any  pair  of  Eqs.  1, 
2,  and  3,  a  simple  power-law  correlation  can  be 
obtained.  For  example,  substituting  Eq.  1  for 
in  Eqs.  2  and  3  yields 
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/„  W3X10-6/™,,  (6) 

Fig.  3  demonstrates  the  correlation 
between  Ig  and  1^.  With  this  correlation  we 
may  monitor  Jf  by  measuring  the  much  larger 
4u»-  The  correlation  in  Eq.  6  has  already  been 
reported  in  the  literature  without  explanation 
[4].  In  addition,  is  linearly  proportional  to 
4>  or  4  ~  4»a*  [ll]-  It  is  important  to  note 
that  these  correlations  are  dependent  of  dev¬ 
ice  dimensions  and  bias  vo.  s.  Recently  the 
Ig  -l ^  correlation  was  shown  to  apply  to  even 
a  0.14/m  channel  device  [14]. 


Simple  Model  of  Channel  Field 
2D  and  3D  simulations  are  the  most  reliable 
means  of  evaluating  the  channel  electric  field 
[  12,15].  The  M1N1M0S  program  [  1 5]  is  freely 
distributed  and  contains  an  I ^  model  No 
simulation  program  available  to  the  public  con¬ 
tains  an  Ig  model.  While  most  analytical  field 
models  were  not  accurate  enough  for  use  in 
hot-electron  studies,  a  recent  quasi-analytical 
model  has  succeeded  remarkably  well  [5]  It  is 
based  on  a  quasi-2D  approach  [16]  A  much 
simplified  version  is  [5,3] 
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where  ij  is  the  junction  depth  and  x0X  is  the 
oxide  thickness.  The  drain  saturation  voltage  is 
approximately  [  17] 


W  = 


y  -  n) 


LES 


'sat 


Vg-Vt+  LE 


(8) 


where  L  is  the  effective  channel  length  in  fj.m , 
and  Efat,  the  critical  field  for  velocity  satura¬ 
tion  is  about  3  x  104  V/cm  for  lightly  doped 
substrates  and  larger  for  scaled  devices  in 
heavily  doped  substrates 

Eqs.  7  and  8  may  not  be  accurate  enough  to 
predict  hot-electron  currents,  but  should  be 
useful  in  predicting  the  trend  and  understand¬ 
ing  the  dependence  of  the  hot-electron  effects 
on  bias  and  dimensions  as  shown  below. 


Dependence  on  Bias  Voltage 

All  bias  voltage  dependence  is  contained  in 
Eqs.  7  and  8.  Specifically,  Vg  and  affect  Em 
only  through  Sing  and  Sudlow  [18]  have 

arrived  at  an  equation  similar  to  Eq  7.  One  of 
their  figures  is  reproduced  in  Fig.  4.  The  initial 
rise  in  the  familiar  bell-shaped  curve  is  due  to 
rising  Id  (see  Eq.  1)  and  the  eventual  fall  is  due 
to  falling  Em  (see  Eqs.  7.  8)  Peak  I^t,  has  been 
fitted  to  exp(-a/  V^)  with  a  dependence  on  L 
[12].  /#  also  exhibits  a  bell-shaped  curve  [19] 
peaking  at  Vg**Vd.  For  Vg>Vd,  Ig  falls  with 
increasing  Vg  because  of  falling  Em-  For  Vg<Vd. 
Ig  decreases  with  decreasing  Vg  and  is  almost 
independent  of  Vd  In  this  case,  the  channel 
may  be  divided  into  two  parts  at  the  point 
where  the  channel  potential  is  equal  to  Vg. 
Between  this  point  and  the  drain,  channel  hot- 
electron  injection  is  negligible  because  of  the 
retarding  oxide  field  Betwc  sn  this  point  and 
source  the  channel  field,  hence  Ig,  is  indepen¬ 
dent  of  Vd  and  decreases  with  decreasing  Vg. 
The  maximum  Ig  has  been  fitted  to  exp(6V(t) 
with  6  dependent  on  L  [12,19] 


Concept  of  Critical  Field 
In  the  theory  of  pn  junction  breakdown,  a 
helpful  concept  is  that  junction  breakdown 
occurs  when  the  peak  electric  field  exceeds  a 
certain  critical  value.  A  similar  concept  can  be 
introduced  for  the  hot-electron  effects.  Accord¬ 
ing  to  Eqs.  1,  2,  and  3,  when  £’m=1.5xi05 
V/cm,  Igma  is  about  I0~15/d,  a  common  cri¬ 
terion  for  acceptable  long-term  stability  [1,12], 
I  tub 81  2x10"®/*  and  In  *»  exlO-10/*  are  also 
quite  acceptable  Therefore,  we  might 
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remember  1  5xi06  V/cm  as  the  critical  field, 
Ee.  for  hot-electron  effects. 

Device  Size  and  Voltage  Limits  and  Scaling  Rule 
Perhaps  surprisingly,  channel  length  affects 
Em  only  indirectly,  through  V (Eqs.  7  and  8) 
In  order  to  ensure  that  Em  is  less  than  the  crit¬ 
ical  field,  Ec.  vd  -  Kilo*  must  be  smaller  than 
KEC  (see  Eq  7).  This  leads  to 
1 


Vd  <  KEC 


+  Vt  + 


[fa  -  Vtf  +  4EsatL[h'Ec  -Vt)\ 


O) 


L-* 0  ^ 

Eq.  9  is  plotted  (using  Esat  =  5xl05  V/cm  )  in 
Fig.  5  together  with  measured  voltages  at  which 
Ig~lO~15Id  [12]  Reasonable  agreement  is 
obtained.  The  three  KEC  values  are  , n  about  the 
same  ratio  as  the  square  roots  of  the  oxide 
thickness  as  predicted  by  Eq.  7.  Using 
Xj  =  0.3 fim,  Ec  is  found  to  be  about  1.5xi05 
V/cm  in  close  agreement  with  the  expected 
critical  field  strength. 

Once  Ec  and  L  are  fixed  the  only  other 
design  changes  that  can  increase  the  Vd  limit  is 
to  increase  x/x^xj  This,  of  course,  would 
degrade  other  device  performances  Eq  9  sug¬ 
gests  a  simple  scaling  rule  for  keeping  all  the 
hot-electron  effects  in  check  (keeping  Em 
below  Ee ):  Vd,  Vt,  and  x/x~x~  all  be  scaled 
linearly  with  L. 


Mechanism  of  Device  Degradations 
The  link  between  hot-electron  injections 
and  device  degradations  is  still  poorly  under¬ 
stood.  What  are  the  microscopic  reactions  or 
the  macroscopic  models  of  the  degradation 
mechanisms?  What  governs  or  is  the  kinetics  of 
degradation0  What  is  the  process  dependence? 
Definite  answers  are  not  available.  Fortunately, 
for  the  purpose  of  technology  development,  one 
may  well  concentrate  on  the  reduction  of  Em 
because  the  electron-trapping  (or  mterface- 
traps-generation)  efficiency  usually  does  not 
vary  as  much  as  the  hot-electron  population, 
which  can  be  greatly  suppressed  by  reducing 
It  is  for  this  reason  that  a  critical  field  can 
be  chosen 
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Improved  Structures 

Many  structures  have  been  proposed 
[12  20]  for  expanding  the  device-size /voltage 
limits.  There  are  two  approaches  The  buried- 
channel  structure  mainly  moves  the  source  of 
hot  electrons  farther  away  irom  the  oxide  The 
lightly-doped  drain  (offset-gate)  structure,  As-P 
double  implants,  and  pnosphorus  junctions 
chiefly  reduce  Em  in  Eq.  7  by  dropping  some 
voltage  in  a  buffe-  zone  F. :  example,  the 
lightlv-doped  drain  structure  shown  in  Fig.  6 

[20]  is  optimized  when  the  field  is  about  con¬ 
stant  (  Eyn  )  throughout  the  length  of  the  N~ 
region,  LN-.  Thus,  the  voltage  capability  of  the 
device  can  potentially  be  improved  by  EeLfj-  or 
about  1.5V  per  fj.m  of  LN-. 

The  degradations  of  other  device  charac¬ 
teristics  due  to  some  of  these  structural 
changes,  if  any,  are  quite  acceptable.  These 
structures  should  gain  popularity  with  VLSI  cir¬ 
cuits  or  high  voltage  circuits  As-P  double 
implants  can  be  added  to  an  existing  process 
most  easily.  The  lightly-doped  drain  structure 
provides  the  greatest  flexibility  in  design 
trade-offs  and  for  that  reason  may  be  favored 
in  many  cases. 
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Fig.  3 

The  device  has  82  rm  gate  oxide.  The  solid 
lines  are  the  theoretical  universal  correla¬ 
tions  between  and  I-,  each  applicable 
for  a  specific  oxide  field  Eox  =  Vgd  /x„. 


Fig.  2 

Collected  substrate  electrons  (leakage 
current)  vs.  the  spacing  between  the  col¬ 
lecting  node  and  the  MOSFET.  Two  distinct 
mechanisms  are  evident. 


Fig  4 

The  theoretical  model  is  simila"  j  Eq.  7 
After  Sing  and  Sudlow  [  18], 
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Fig  6 

Lightly-doped-drain  structure  and  the 
channel  field  profile  After  Tsang  et  al.[20l 
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ABSTRACT 

The  phenomenon  of  and  the  physical  mechanisms  for  the  gen¬ 
eration  of  minority  carriers  in  the  substrate  of  NMOS  and  CMOS  are 
studied.  Secondary  impact-ionization  is  not  responsible.  The 
responsible  mechanisms  are  hot-electron  induced  photo-carrier 
generation  and  .  under  extreme  conditions,  forward  biasing  of  the 
source-substrate  junction.  The  photon  generation  is  believed  to  be 
due  to  the  bremsstrahlung  of  the  channel  hot-electrons.  A 
theoretical  model  based  on  the  lucky  electron  concept  and 
bremsstrahlung  mechanism  is  proposed.  The  calculated  charac¬ 
teristics  of  photon  generation  agree  well  with  experimental  results. 
About  2xl0~3  photo-generated  minority-carriers  are  generated  for 
every  (primary)  impact-ionization  event  in  n-MOSFET.  Photo- 
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1.  Introduction 

The  presence  of  and  the  circuit  performance  degradation  due  to  minority 
carriers  in  MOS  IC  substrates  have  been  widely  recognized  [1,2.3].  Excess 
minority  carriers  in  the  substrate  are  believed  to  degrade  the  holding  time  in 
dynamic  RAMs  and  to  induce  errors  in  digital  logic  circuits.  It  has  been  verified 
experimentally  that  the  minority  carriers  in  the  substrate  originate  from  the 
peripheral  MOSFETs  which  are  operating  in  the  saturation  regime.  Two  mechan¬ 
isms  have  been  proposed  for  the  generation  of  the  minority  carriers  in  the  sub¬ 
strate.  They  are,  (i)  the  secondary-impact-ionization  by  holes  (in  NMOS)  that 
constitute  the  substrtate  current  [1,2]  and  (ii)  the  injection  of  minority  carriers 
from  the  source  junction  caused  by  the  forward  biasing  due  to  the  flow  of  the 
substrate  current  [3]. 

It  has  been  shown  that  secondary-impact-ionization  is  not  justified  theoreti¬ 
cally  [3]  while  the  injection  of  minority  carriers  from  the  source  junction  into 
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tbe  substrate  is  not  the  only  mechanism.  A  new  mechanism  which  attribute  the 
generation  of  minority-carriers  by  photons  [4,5]  has  arecently  been  proposed. 
This  photo-carrier  generation  mechanism  is  fundamentally  related  to  the 
presence  of  hot-electrons  at  the  drain  end  of  the  scaled  MOSFET.  Only  when  the 
substrate  current  becomes  excessively  high,  will  the  ohmic  drop  in  the  sub¬ 
strate  forward  bias  the  source  to  substrate  junction  sufficiently  and  make  the 
injection  of  minority  carriers  possible  [8]. 

Using  experimental  results  from  NMOS  and  CMOS  test  structures,  the  two 
mechanisms  of  minority  carrier  generation  (photo-carrier  generation  and 
source  minority-carrier  injection)  are  illustrated.  A  theoretical  model  based  on 
the  lucky  electron  concept  will  be  presented  to  model  the  hot-electron  induced 
photo-carrier  generation  mechanism.  This  model  will  be  subsequently  com¬ 
pared  to  the  experimental  results.  The  supression  of  the  injection  of  minority 
carriers  from  the  source  junction  will  be  demonstrated. 

2.  Experimental  Techniques 

The  presence  of  minority  carriers  in  the  substrate  can  be  detected  using 
the  experimental  configuration  shown  in  flgure(l)  on  NMOS  wafers.  A  reverse 
biased  pn-junction  is  shown  at  a  fixed  separation  from  a  MOSFET  biased  in  the 
saturation  region.  Since  this  pn-junction  is  reverse  biased,  it  acts  as  a  collector 
of  minority  carriers  in  the  substrate.  In  the  following,  we  shall  call  this  junction 
the  collecting  junction.  The  above  structure  is  placed  inside  a  light  shielded 
metal  enclosure.  The  current  Imu  through  the  collecting  junction  is  monitored 
by  an  electrometer.  The  drain  to  source  current  and  the  substrate  current 
h'JB  MOSFET  are  also  measured. 

In  flgure(2),  the  experimental  configuration  used  on  a  p-well  CMOS  wafer  is 
illustrated.  The  collecting  junction  (p*n-junction)  is  located  outside  the  p-well 
and  is  a  collector  of  holes  (minority  carrier)  in  the  n-substrate.  The  well-to- 


substrate  junction  is  reverse  biased  and  hence  the  collector  can  collect  minority 
carrier  (electrons)  generated  inside  the  p-well.  The  reverse  bias  between  the 
well  and  the  substrate  also  inhibits  the  flow  of  holes  from  inside  the  well  to  the 
n-substrate.  An  N-channel  MOSFET  located  inside  the  p-well  is  biased  in  the 
saturation  region  and  the  currents  I  neon .  /#.  Is  and  Is’jb  shown  in  figure(2)  are 
simultaneously  monitored. 

3.  Experimental  Results 

Typtical  experimental  results  of  the  MOSFET-collector  pair  for  the  NMOS 
wafer  are  shown  in  figure(3)  and  (4).  In  figure(3),  the  measured  currents  are 
plotted  against  the  gate-to-source  voltage  )  whereas  in  figure(4),  the 

currents  are  plotted  against  the  drain-to-source  voltage  (  V^s  )•  It  is  apparent 
that  a  correlation  between  the  current  collected  by  the  collecting  junction  and 
the  substrate  current  exists.  This  is  demonstrated  in  figure(5)  by  the  log-log 
plot  of  the  data  from  figure(4).  Referring  to  flgure(5)  we  can  indentify  two 
regimes.  When  the  substrate  current  is  not  high,  the  collected  minority-carrier 
current  increases  almost  (but  less  than)  linearly  with  the  substrate  current. 

When  the  substrate  current  is  high,  the  collected  minority  carrier  current 
increases  nearly  exponentially  with  the  substrate  current  (figure(5)).  This  indi¬ 
cates  that  when  the  substrate  current  is  very  high,  the  ohmic  drop  caused  by 
the  flow  of  the  substrate  current  can  forward  bias  the  source-to-substrate  junc¬ 
tion  and  cause  minority  carriers  to  be  injected  from  the  source  into  the  sub¬ 
strate.  An  important  parameter  in  this  situation  is  the  effective  substrate  resis¬ 
tance  Rs’JB-  Theoretical  considerations  showed  that  Rsvb  Is  a  function  of  chan¬ 
nel  width,  channel  length  and  substrate  doping  level  [6].  The  near-exponential 
regime  may  begin  only  when  Is’JB^S'JB  >  0.6 K.  An  Rs’jb  value  of  approximately 
IlfcQ  is  obtained  from  the  data  in  flgure(5).  By  incorporating  external  resis¬ 
tances  (  Rfxr  )  or  by  applying  a  slight  positive  voltage  to  the  substrate,  the 


injection  of  minority  carriers  from  the  source  junction  can  be  enhanced. 
ITgure(6a)  and  (6b)  illustrates  the  enhancement.  Figure(6a)  confirms  that  injec¬ 
tion  of  minority  carriers  from  the  source  junction  begins  to  occur  when  the  sub¬ 
strate  potential  reaches  about  0.6V  above  the  source  potential.  Reverse  sub¬ 
strate  bias  can  supress  this  minority  carrier  injection  mechanism  as  shown  in 
flgure(7).  One  can  conclude  that  by  proper  scaling  of  the  substrate  doping  or 
using  epitaxial  substrate  to  reduce  Rsvb  and  by  the  use  of  reverse  substrate 

•i 

bias,  the  injection  of  minority  carriers  from  the  source  into  the  substrate  can  be 
significantly  or  effectively  supressed. 

In  contract,  the  relationship  between  the  collected  minority-carrier  current 
and  the  substrate  current  in  the  low  substrate  current  regime  is  unaffected  by 
the  addition  of  Rext  or  variation  in  Vsyg  as  shown  in  figures(6)  Sc  (7).  This  indi¬ 
cates  that  the  injection  of  minority  carriers  from  the  source  is  not  responsible 
for  the  minority  carriers  in  the  bulk  when  I$<jg  is  moderate  or  low.  It  has  been 
suggested  that  secondary  impact-ionization  due  to  holes  (ie.  Isvb)  in  NMOS 
introduce  the  minority  carriers  in  the  substrate  [1,2].  A  rather  simplified 
analysis  made  by  Eitan  et  al.  [3]  showed  the  secondary-impact-ionization  to  be 
10°  times  too  weak  to  explain  the  observed  Imrj, .  We  had  independly  arrived  at 
the  same  conclusion  following  a  more  rigorous  analysis  which  found  the  effect  to 
be  10*°  times  too  weak. 

The  correlation  between  ISrJB  and  I coll  is  essentially  independent  of  the 
bias  voltages  ,  channel  current,  channel  length,  gate  oxide  thickness,  and  all 
other  device  parameters.  The  independence  of  this  correlation  on  Vqs,  1^$,  and 
/jj$  is  demonstrated  in  figure(8).  Here,  we  have  included  the  relationship 
between  Icoll  and  h’JB  flt  three  effective  gate  voltage  levels.  The  gate  voltages 
and  drain  currents  in  the  three  cases  vary  by  the  ratio  1:2:3.  Yet,  we  see  that 
t±»e  Icou,  '  h’JB  relationship  is  essentially  unchanged. 
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The  dependence  of  the  collection  of  minority  carriers  on  the  spatial  separa¬ 
tion  between  the  collecting  junction  and  the  MOSFET  gives  some  insight  into  the 
transport  properties  of  the  minority  carriers.  In  flgure(9),  two  cases  of  interest 
are  compared.  In  one  case,  an  external  resistance  (1MD)  was  used  to  enhance 
the  injection  of  minority  carriers  from  the  source-to-substrate  junction.  In 
another  case,  the  MOSFET  was  operated  in  the  low  substrate  current  regime  (  ie. 
before  the  turn-on  of  the  source-to-substrate  junction  ).  The  spatial  dependence 

•i 

of  the  two  cases  are  quite  different.  A  simple  exponential  decay  is  observed  for 
minority  carrier  injection  from  the  source  where  a  decay  length  of  about  31  fun 
is  obtained.  Using  the  diode-reverse-recovery  technique,  the  minority-carrier 
lifetime  was  determined  to  be  about  0.23/xsec. .  This  decay  length  is  interpreted 
as  the  minority  carrier  diffusion  length.  The  low  Isvb  regime  curve  appears  to 
have  two  decay  lengths.  When  the  separation  is  small,  the  decay  length  is  simi- 
liar  to  the  previous  case.  However  when  the  separation  becomes  large,  the 
effective  decay  length  becomes  much  longer  and  a  value  of  780 fim  is  arrived. 

The  long  decay  lenght  part  of  the  curve  can  also  be  fitted  to  the  form  — =-e  L 


with  a  decay  length  of  about  1  mm.  This  difference  in  the  spatial  dependence 
suggests  that  diffusion  of  minority  carriers  cannot  explain  the  transport 
mechanism  in  the  low  substrate  current  regime.  Therefore,  neither  the  injec¬ 
tion  of  minority  carriers  from  the  source  nor  secondary-impact-ionization  can 
explain  the  experimental  results. 

A  new  mechanism  has  been  proposed  in  which  photons  emitted  from  the 
higb-fleld  regions  near  the  drain  of  the  MOSFET  generate  electron-hole  pairs  in 
the  substrate  [4,5],  Photons  with  energy  more  than  a  few  tenths  of  eV  above  the 
silicon  band-gap  are  absorbed  very  near  the  MOSFET  and  the  minority  carriers 
generated  then  spread  by  diffusion.  Photons  with  energy  very  close  to  the 
band-gap  have  relatively  low  absorption  coefficient  and  hence  can  travel  a  long 


-6- 


f 


distance  before  they  are  absorbed.  The  combination  of  the  above  two  transport 
mechanisms  can  qualitatively  explain  our  experimental  results  in  figure(9). 

Direct  observation  of  light  emission  from  the  drain  end  of  the  MOSFET 
confirm  this  proposal  [7].  In  figure(lOa),  we  have  shown  a  time-exposure  photo¬ 
graph  of  a  MOSFET  operating  in  the  saturation  regime.  A  double  explosure  is 
taken  to  illustrate  the  background  structure  of  the  MOSFET.  The  IV  curves  of 
the  device  is  illustrated  in  figure(lOb).  The  device  shown  in  figure(lOa)  has  a 
channel  width  of  100 ycm,  channel  length  of  1.1/j.m,  and  oxide  thickness  of  37 nm. 
Under  the  microscope,  a  yellowish-white  fialment  of  light  along  the  width  of  the 
drain  can  be  observed  fairly  easily  even  at  lower  V^s-  The  intensity  of  the  light 
appears  to  be  fairly  uniform  along  the  width  when  Vfo  is  not  very  high-  At  higher 

spots  with  higher  intensities  emerge  which  indicates  the  presence  of  local¬ 
ized  "  hot-spots  The  direct  observation  of  light  emitted  from  the  drain  end  of 
the  MOSFET  provides  good  confirmation  of  the  proposed  photocarrier  generation 
model. 

Experimental  results  from  the  p-well  CMOS  wafer  are  even  more  convincing 
than  that  of  the  NMOS  wafer.  Figure(ll)  and  (12)  are  typtical  results  plotted 
against  and  Vqs  respectively.  The  presence  of  Ihcou  can  not  be  explained  by 
either  the  secondary-impact-ionization,  or  source  injection  since  these  mechan¬ 
isms  can  not  give  rise  to  holes  outside  the  well.  The  holes  are  generated  by  pho¬ 
tons  as  illustrated  in  figure(2).  A  correlation  between  the  currents  IN  and  Iumn. 
with  the  substrate  current  Is<jb  of  the  MOSFET  is  illustrated  in  figure(l3).  The 
relationship  between  If/  and  ISrJB  is  almost  linear  while  the  relationship  between 
Ihcou  and  h'JB  slightly  sub-linear,  in  figure(l4),  the  spatial  dependence  of  the 
Ihcou  *s  presented.  An  effective  decay  length  about  702 fim  os  pbserved.  This  is 
in  good  agreement  with  the  effective  decay  length  obtained  from  the  NMOS 
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The  current  /#  deserves  some  explanation.  Since  the  p-well  to  the  n- 
substrate  junction  is  reverse  biased,  minority  carriers  inside  the  well  (  ie.  elec¬ 
trons  )  and  just  outside  the  well  (  holes)  will  be  collected  by  this  junction  and 
give  rises  to  /*.  The  depth  of  the  p-well  in  this  wafer  is  about  4 pm.  Since  most 
of  the  photons  will  be  absorbed  not  too  far  away  from  the  p-well  junction,  l h 
therefore  can  be  interpreted  as  the  total  minority-carrier  generated  per  second 
by  photons  originated  from  the  MOSFET.  The  presence  of  Ihcoll  in  the  n- 
substrate  on  the  other  hand  is  due  to  photons  with  near  band-gap  energies.  If 
we  assume  that  the  near  band-gap-photon  component  is  small  compared  to  the 
total  number  of  photons  generated  and  one  photon  can  only  generate  one 
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electron-hole  pair,  then  the  ratio  - - may  be  interpreted  as  the  ratio  of  the 

I  sub 


rate  of  photo-carrier  generation  to  the  rate  of  impact-ionization.  (The  substrate 
current  Js,jg  is  a  measure  of  the  population  of  hot-electrons  in  the  MOSFET.) 
This  ratio  is  plotted  against  the  normalized  substrate  current  in  figure(l5).  A 
ratio  of  about  2X10-4  is  obtained  which  agrees  with  the  value  of  ~-3xl0-4 
obtained  by  Matsunaga  et.  al.  [2]  and  5xl0“s  obtained  by  Childs  et  al.  [8]  consid¬ 
ering  the  potential  inaccuracies  of  the  measurements.  This  ratio  varies  only  by 
a  factor  of  4  when  the  normalized  substrate  current  varied  by  about  five 
decades  with  changing  Fbs  and  y<s  combinations.  From  our  experimental 
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results,  it  appears  that  the  ratio  is  about  4xl0'5  at  low  — - and  decreases  with 

bs 


increasing  ^S.-B  .  Childs  et  al.  [8]  however  found  this  probability  to  be  relatively 
Ids 


constant  with  Vps  and  Vqs.  The  cause  of  the  difference  between  our  observation 


h’JB 


and  Child  et  al.  [8]  may  be  that  they  only  measured  this  ratio  for  — - ranging 

Ids 


from  ~  10-3  to  10’3. 
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4.  Physical  Mechanism  and  Model 

The  phenomenon  of  light  emission  from  avalanching  silicon  pn-junctions 
(diodes)  has  been  studied  extensively  [9-14].  This  light  emission  process  was 
attributed  to  the  radiative  interband  or  intraband  transitions.  The  specific  pro¬ 
cess  considered  likely  were  (l)  radiative  transition  of  holes  between  the  light- 
hole  band  and  the  heavy-hole  band  [ll],  (2)  radiative  recombination  between  a 
hot  electron  and  a  free  hole  [8],  and  (3)  the  bremsstrahlung  radiation  due  to  the 
scattering  of  the  hot  electrons  by  charged  Coulombic  centers  [12,13,14]. 

The  experimental  results  of  Matsunaga  et  al.  [2]  showed  that  the  generation 
of  minority  carriers  in  the  substrate  is  proportional  to  the  substrate  current  in 
both  NMOS  and  PMOS  and  they  found  the  ratio  of  minority  carrier  generation 

rate  to  ^S^-B  to  be  3x10"®  and  lxlO-4  in  NMOS  and  PMOS  respectively.  The  simi- 

liarity  between  NMOS  and  PMOS  quantum  efficiencies  suggests  that  the  radiative 
transitions  of  holes  between  the  light-hole  band  and  the  heavy-hole  band  is  not 
the  mechanism  since  this  mechanism  would  strongly  favor  PMOS  over  NMOS  due 
to  the  much  larger  hot-hole  population  and  the  much  higher  hole  temperature 
in  PMOS  than  in  NMOS. 

It  is  more  difficult  to  choose  between  the  remaining  two  mechanisms.  How¬ 
ever,  from  the  experimental  results  presented  in  Section(3),  we  know  that  Icon 
and  hence  the  photon  generation  rate  is  proportional  to  IsvBy  where  y  is 
between  0.7  and  0.9  and  independent  of  Ids  (figure(B)).  The  rate  of  recombina¬ 
tion  between  electrons  and  holes  would  be  porportional  to  the  product  of  elec¬ 
tron  density  (  ~  Ids  )  and  the  hole  density  (  ~  Is'jb  )  or  to  the  product  of  hot- 
electron  density  (  ~  Is'jb<  which  is  the  creation  of  the  bot-electrons  )  and  the 
hole  density  (  ~  Is'jb  )•  both  in  disagreement  with  data.  This  strongly  disfavors 
the  direct  radiative  recombination  mechanism.  Also,  the  observation  of  the 
polarization  of  lights  from  avalanche  breakdown  in  pn-junctions  [12, 14]  tends  to 
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support  the  bremsstrahlung  mechanism. 

Using  the  bremsstrahlung  radiation  as  the  principle  photon-generation 
mechanism,  we  can  formulate  the  following  model.  The  lucky  electron  model 
first  proposed  by  Shockley  [15]  to  model  the  hot-electron  effects  in  silicon  pn- 
junctions  are  found  to  describe  the  impact-ionization  and  the  channel  hot- 
electron  injection  effects  in  MOSFETs  very  well  [16,17,18].  Using  the  same  for¬ 
mulation,  the  probability  for  the  channel  channel-hot-electrons  to  attain  kinetic 
energy  in  excess  of  U  can  be  written  as 

P{U)  =  e  ^  (1) 

where  q  is  the  electronic  charge,  U  is  the  kinetic  energy  of  the  hot  electron,  Ex 

is  the  channel  electric  field  and  A  is  the  mean-free-path  of  the  channel-hot- 

electrons  mainly  due  to  optical-phonon  scattering.  The  probability  Qvdv  of  the 

emission  of  a  photon  with  the  energy  in  the  interval  hv  to  h(i/+di/)  due  to  one 

electron  passing  through  dx  (thichness)  containing  Nc  singly-charged  Coulomic 

centers  per  volume  can  be  written  as  (appendix(l)) 

dvdx 


Qvdvdx-DNc 


(2) 


TTL  ’  Uhv 

where  U  is  the  kinetic  energy  of  the  electron,  m’  is  the  electron  conductivity 
effective  mass  and  .0=2.8x10®  q°  is  a  numerical  constant  (appenidx(l))  Based  on 
equation(l),  the  number  of  hot-electrons  passing  through  the  channel  with 
kinetic  energy  from  U  to  U+dU  can  be  written  as 

.  _ Z_ 

R(U)  dU  =■ 'E'X)dU  (3) 

-  Ids  ,e  du 


qzEx\ 

where  fDS  is  the  channel  current  of  the  MOSFET.  The  total  number  of  photon 
generated  at  the  frequency  interval  from  v  to  v+dv  due  to  the  hot-electrons 
having  the  kinetic  energy  from  U  to  U+dU  is 


•I 
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LLErr 

f  Qv  dvR{U)  dU  dx 
0 

V 

_  L*?  DdvNclBSt  . 

-  |  ^  <4) 

FVom  equation(4),  we  can  evaluate  the  total  energy  radiated  in  the  frequency 

range  v  to  i/+di/  per  unit  time  due  to  all  hot-electrons  with  initial  kinetic  t  'ergy 

above  h  v  as 


V  r  Dd  vNcIds  e  ?£* 
Wvdv-hvdv  f  J  — — 
o  {v  m  qzEx\  U 


/  f  f  nv  ' 
_  DduNglps  frl[^V 

m  *  qz\  o  Ex 


,*^.7,^  (5a) 

m.  qhv  o 

where  E t  is  the  exponential  integral.  In  obtaining  equation(5a),  we  have 


assumed  that  the  maximun  channel  electric  field  is  a  few  times  10s  Vcm~l  and 
the  hot-electron  scattering  mean-free-path  is  on  the  order  of  lOnm  [17,18]. 
Therefore,  the  product  qEx\  is  at  most  a  few  times  0.  leV.  Since  we  are  only 
interested  in  photons  with  energy  greater  than  1.12eF,  the  argument  in  Ey  in 

equation(5)  is  usually  greater  than  5  and  we  have  ^(z)*- - [20].  By  a  change 

z 

dEx 

of  variable  and  assumming  that  «consfanf ,  we  have 

S*ZMSL  7  _iL_4(!  f&y  (6) 

m  *hv  0 

dx'ET’ 

DdvNcIos  x 

=  - : - 1 - 1 — e  "*  (6a) 


where  Em  is  the  maximun  channel  electric  field  (ie.  at  the  drain  end).  To  obtain 

equation(6a),  we  have  assummed  |  ■■■■  *  j  is  constant  and  hence  can  be  taken 

ax  l 


our  of  the  integral.  The  quantity 


has  a  dimension  of  length  and  can 
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y.v'.vv-  .'vvv 


In  the  MOSFET,  the  Impact-ionization  substrate  current  can  be  formulated 


as  [17,18,20] 


IsUB- 


IpsAjEm2  _~£m 


where  \  and  Bx  are  the  pre-exponential  and  exponential  constants  in  the  ioniza¬ 
tion  coefficient  respectively  [25].  By  eliminating  the  maximun  channel  field  Em 
between  equation(lO)  and  equation(l2),  we  obtain 

[d£rl  ilxv£b) 

/.  '**  (s-,,  {I0) 

Vo  m  I  dEz  Ids  A,  El 

E'\-dt]L 

Since  Eg  +  b-  1.28e7  and  gf?iA«1.24e V  [13],  we  have  al.  Then  equa- 

qBi\ 


tion(l3)  is  reduced  to 


/„  .=7.4X1 0-3’  (s-i}  (11) 

*'  m  Ai 


The  total  minority-carrier  current  generated  is 

,/v..-=7.4»<10 

•  t  gm  Ai 

=  1.8xl0-3/5yB  (forNMOS) 


It  is  seen  from  equation(ll)  that  the  total  photon  generation  rate  is  propor¬ 
tional  to  the  substrate  current  /gyg  and  independent  of  the  channel  current  Ips 
in  agreement  with  the  data  presented  in  Section(3).  In  obtaining  equation(l2a), 
we  have  used  gi?»X=  1.24eK  [18],  Eg-  1.12eK,  i4i=1.6xl07m_1  [18] 

m*=0.26mg  =0.26x9. 1  lx  10_31A:^ [26],  and  Np  equals  to  the  drain  doping  concen- 

qfv  . 

tration  ~5xlOZ0cm~3=5xlOZ87n"3.  The  ratio,  — ,  of  1.8xl0"5  obtained  is  in 

‘S’JB 

excellent  agreement  with  our  experimental  results  and  those  reported  in  the 
literature  [2,9,27].  The  values  of  \  and  are  less  accurately  known  for  PMOS 
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(holes).  If  one  uses  q5<X=1.47e  V,  m *=0.3867?%,  and  Ai=Z.Zl'K\Qem.~l,  the  ratio  is 
found  to  be  lxlO-4.  There  is  some  evidence  that  At  is  proportional  to  Em  (ie. 

)  [17.25].  That  and  equation(l2)  may  explain  the  decrease  of  the  ratio  at 

Jos 

high  Is’jb  such  as  evident  in  figure(l5). 

In  the  above  estimation,  we  have  assumed  that  the  Nq  is  equal  to  the  drain 
doping  density.  That  is.  we  have  assumed  that  the  ionized  donors  at  the  drain 
are  causing  the  scattering  of  the  hot-electron.  This  is  similiar  mobility  degrada¬ 
tion  phenomenon  found  in  heavily  doped  semiconductors. 

The  photons  that  can  travel  a  long  distance  to  generate  minority  carriers 
are  those  that  have  energy  very  close  to  Ea  =hv0 .  From  equation(7a),  the  rate  of 

Wv 

generation  of  photons  with  energy  between  t/0  and  v„  +At/  is  (ie.  ^ 


eqEmX Lo 

Eliminating  Ejn  from  equation(l5)  using  equation(l2)  and  one  obtains 


K,.V0  +AW  ~  I. 

< 


This  is  in  excellent  agreement  with  the  slopes  in  figure(5),  figure(6),  and 
figure(l3)  ( Ihcoll )•  Equations(l  )  and  (1  )  can  explain  the  two  different  slopes 
in  figure(13).  In  figure(18),  we  have  plotted  IVfV§^v  as  a  function  of  ISyB  using 

^-as  the  parameter  (equation(l  )  and  (1  )).  When  the  ratio  ^-decreases,  the 

ve  v» 

relationship  between  the  photon  generation  rate  (in  the  range  ve  to  i/e+Au)  and 
Js’jb  is  sub-linear.  This  also  agrees  with  our  experimental  results  shown  in  Sec- 
tion(3)  (eg.  flgure(l3)). 

In  order  to  calculate  the  number  of  minority-carriers  generated  per  unit 


time  as  a  function  of  location  in  the  wafer,  we  have  to  solve  the  three  dimen¬ 
sional  current  continuity  equation  subjected  to  the  photon  generation  rate 
described  above,  and  the  geometries  of  the  MOSFET  and  the  collector  pair,  and 
also  we  have  to  know  the  accurate  dependence  of  the  absorption  coefficient  on 
photon  energy.  Due  to  the  complexity  of  this  part  of  the  problem,  this  will  not 
be  treated  here. 

5.  Conclusion 

The  phenomenon  of  minority-carrier  generation  in  the  substrate  of  VLSI 
chips  is  studie  .  Based  on  experimental  results  on  NMOS  and  CMOS  wafers,  we 
concluded  that  the  minority-carrier  generation  mechanisms  are  ,(i)  injection  of 
minority-carriers  form  the  source  to  substrate  junction  when  the  substrate 
current  is  excessively  high,  and  (ii)  photo-carrier  generation  where  the  photons 
originate  from  the  drain  high-field  region  of  the  MOSFET.  Specifically, 
secondary-impact-ionization  is  found  to  play  no  part  in  the  creation  of  substrate 
minority-carriers. 

New  evidenes  rule  out  the  hot-hole  interband  transitions  and  tend  to  con¬ 
tradict  the  electron-hole  recombination  as  the  mechanism  of  light  generation. 
No  contradiction  is  found  with  the  bremsstrahlung  explanation.  A  theoretical 
formulation  of  the  hot-electron  induced  photon  generation  phenomenon  is 
presented.  The  theory  is  based  on  the  bremsstrahlung  of  the  hot-electrons. 
Using  this  approach  and  the  lucky  electron  concept,  the  theoretical  quantum 
efficiency  of  photon  generation  is  obtained.  Assuming  that  one  photon  with 
energy  above  the  Si  band-gap  can  generate  only  one  electron-hole  pair,  we 
obtained  the  theoretical  minority-carrier  generation  rate.  A  theoretical  value 
for  the  minority-carrier  generation  rate  of  1.8xl0"5  for  each  impact-ionization 
event  in  NMOS  and  for  PMOS  are  in  good  agreement  with  the  experimentally 
obtained  value  of  «2xl0-s  and  lxlO"4.  The  required  density  of  Coulombic 
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scattering  centers  was  found  to  be  the  ionized  drain  impurity  dopant.  The 
generation  rate  of  near  bandgap  photons  increases  with  I$<jb  sublinearly  as 
~  “  for  NMOS. 


The  presence  of  hot-carriers  in  N-channel  MOSFETs  is  widely  recognized 
[28,29,30].  The  observation  of  the  substrate  current  and  the  gate  current  are 
two  examples  of  hot-electron  effects  in  MOSFETs.  Therefore,  we  should  expect 
the  light  emission  process,  which  is  just  another  hot-carrier  effect,  to  be  present 
in  MOSFETs.  It  should  be  noted  that  the  light-emission  process  is  not  a  direct 
consequence  of  impact-ionization,  rather,  it  is  merely  due  to  the  presence  of 
hot-carriers.  Since  hot-carriers  will  also  cause  impact-ionization  and  leads  to 
substrate  current,  we  would  expect  a  correlation  between  the  substrate  current 
and  the  photon-generated  minority-carrier  current  to  exist. 

Previous  studies  on  light  emission  from  silicon  pn-junctions  under 
avalanche  breakdown  showed  that  the  spectrum  of  radiation  may  be  approxi- 

-Hv 

mated  as  [12]  Wv~  e  •  where  T,  is  the  electron  temperature  of  the  hot- 
electrons  and  k  is  the  Boltzman's  constant.  To  apply  this  to  the  MOSFET,  we 
need  to  know  the  relationship  between  the  electron  temperature  T,  and  the 
channel  electric  field  as  well  as  the  electric  field  (this  relationship  is  unfor¬ 
tunately  not  accurately  known).  Using  the  lucky  electron  concept,  we  have 

derived  Wy  ~  e  v  "*  ,  where  A  is  the  hot-electron  me an-fre e-path  and  Em  is  the 
peak  channel  electric  field.  In  essence,  we  have  shown  that  T,  =  ^Em\  [31]. 

Furthermore,  a  correlation  between  the  photon  generation  rate,  and  hence  the 
minority-carrier  generation  rate,  and  the  substrate  current  is  theoretically 
obtained  as  stated  above. 

Form  this  model,  we  concluded  that  a  possible  way  to  reduce  the  photo- 
carrier  ge  neration  is  the  use  of  lightly-doped  source-drain  structure  [32]  or 
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other  improved  structures  in  the  design  of  VLSI  MOSFETs.  On  the  other  hand,  the 
common  technique  of  employing  quard-rings  or  dummy  collectors  is  less 
effective  in  combating  the  long-range  substrate  leakage  current  described  here 
as  may  be  expected.  The  photo-carrier  induced  leakage  current  decays  with 
distance  roughly  according  to  the  diffusion  length  at  close  range.  At  long  range, 
it  can  be  fitted  to  either  a  square  dependence  on  distance  or  an  exponential 
depen  dence  with  an  effective  decay  length  of  about  780 pirn. 
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Appendix(l)  :  Formulation  of  the  Basic  Equations 

We  shall  in  this  appendix,  formulate  the  basic  equations  used  in  the  model 
of  bremsstrahlung  that  is  presented  in  section(4)  (ie.  equations(3)  and  (4)). 

Based  on  the  classical  electromagnetic  theory,  when  an  electron  collides 
with  a  a  singly  charged  Coulombic  center,  the  electr  on  will  move  on  a  hyperbolic 
orbit  subjected  to  pure  Coulombic  interaction  and  will  give  off  energy  according 
to  the  following  equation  [33] 


qitSiA 


(Al.l) 


b  a  \  a  •  2  /  V"***/ 

6ne,ca  4tt£ 

where  c  is  the  speed  of  light  in  vacuum,  e0  is  the  permittivity  of  free  space,  tsi 
is  the  relative  dielectric  constant  of  Si.  is  the  effective  dielectric  constant  in 
relation  to  the  hot-electrons,  ,  q  is  the  electronic  charge,  c  is  the  speed  of  light, 


and  r  is  the  distance  between  the  electron  and  the  Coulombic  center.  Kramer 
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where  G(£)  is  a  rational  polynomial  [29].  In  figure(A2.l),  we  have  plotted  the 
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h\/%  \  ~  n£  \ 

function  1 —C(  *  .  )  versus  — — - :  This  can  be  approximated  by  a  e  q  m 


where  a  =0.30206  and  b=0.1585eK  (dotted  curve).  In  figure(A2.2).  we  have  plot¬ 


ted  the  quantity  inside  the  [j  in  equation(A2. 1)  versus  A  very  good  fit  to  this 
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Abstract  Phot— i  arc  generated  by  forward  biasing  a  silicoa  p-a 
juactJoa  at  II*5  -  10~4  quantum  efficiency  tbroufb  radiative  re¬ 
coin  bi  nation.  At  larxe  distance*  from  tbe  forward -biased  junction, 
leakage  currents  of  magnitudes  significant  for  some  VLSI  circuits  can 
appear  due  to  tbe  substrate  minority  carriers  generated  by  tbe  pho¬ 
tons.  The  effective  decay  length  of  tbe  measured  leakage  current  it 
about  several  hundred  to  one  thousand  micrometers.  The  effects  of 
forward  biasing  an  input  node  or  a  parasitic  lateral  bipolar  transistor 
arc,  therefore,  longer  ranged  than  commonly  aasaamd. 


I.  INTRODUCTION 

1GHT  EMISSION  from  silicon  has  long  been  observed 
in  both  forward-biased  and  reverse-biased  avalanching 
silicon  p-n  junctions  [1],  but  its  effects  in  modem  IC’s  has 
been  little  discussed.  For  forward -biased  p-n  junctions  the 
mechanism  for  photon  generation  is  radiative  recombination, 
and  the  light  emission  has  been  used  as  a  monitor  of  the  uni¬ 
formity  of  current  in  p-n-p-n  thyristors  [2] .  In  reverse -biased 
avalanching  junctions,  the  mechanism  for  photo  genera¬ 
tion  is  direct  transitions  between  different  valence  bands  [3] 
or  hot-electron/hole  recombinations  [1].  Fig.  1  shows  the 
qiectra  for  both  a  forward-biased  and  a  reverse-biased  ava¬ 
lanching  junction  [1].  The  spectrum  of  the  forward-biased 
case  has  a  peak  at  1.1  eV  and  a  sharp  cutoff  at  both  high  and 
low  energies.  For  a  reverse -biased  avalanching  junction,  the 
spectrum  is  broad  and  extends  to  photon  energies  greater 
than  3  eV. 

It  was  recently  reported  that  photon  generation  induced 
by  hot  carriers  does  exist  in  silicon  MOSFET's  when  operating 
in  the  saturation  region  [4] -[6] .  The  same  phenomenon  was 
also  observed  in  CMOS  VLSI  devices  (6) .  The  mechanism  of 
photon  generation  and  hence  the  spectrum  are  similar  or 
identical  to  the  case  of  a  reverse-biased  avalanching  p-n  junc¬ 
tion  [3],  [6].  As  device  dimensions  are  scaled  down,  the 
effect  of  hot-electron-generated  photons  in  silicon  integrated 
circuits  has  become  important  since  they  can  generate  minor¬ 
ity  carriers  in  the  substrate  and  discharge  sensitive  nodes. 
Already  DRAM  refresh  time  degradation  due  to  this  phenome¬ 
non  is  well  known  [7] ,  and  upset  of  SRAM  or  logic  circuit 
is  possible  [8] .  This  paper  describes  the  other  occurrence  of 
light  emission,  that  from  forward-biased  p-n  junctions  [6] , 
with  the  intention  of  highlighting  its  ability  to  discharge 
sensitive  nodes  in  1C  circuits. 
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Fig.  1.  The  emission  spectra  for  forward-biased  and  reverse-biased 

avalanching  silicon  p-n  junctions.  Solid  line  ( - )  for  the  forward- 

biased  case,  dashed  line  ( — )  for  the  avalanche  breakdown  case 

ID- 

11.  EXPERIMENT 

Our  simple  test  chip  consists  of  many  isolated  junctions. 
The  junctions  used  are  actually  the  source  or  drain  of  en¬ 
hancement-mode  MOSFET's.  One  junction  acts  as  a  collector 
of  minority  carriers  and  is  reverse -biased  by  1  V.  The  other 
junctions,  which  are  separated  from  the  collector  by  varying 
distances,  act  as  injectors  of  minority  carriers  and  are  forward 
biased  by  a  current  source.  Measurements  are  done  for  both 
n+  diffusions  in  a  p-substrate  and  p*  diffusions  in  an  n-sub- 
strate.  The  collecting  junctions  are  49  pm  X  108  pm  and  0.3 
pm  deep  in  the  p-substrate,  25  pm  X  14  pm  and  0.6  pm  deep 
in  the  n-substrate.  The  substrate  concentrations  used  are 

6.6  X  101S  cm-3  for  the  p-substrate  (boron  doped)  and 

2.7  X  1017  cm-3  for  the  n-substrate  (phosphorus  doped). 
There  are  no  particular  reasons  to  choose  different  substrate 
concentrations  and  no  intention  to  make  the  concentrations 
differ  so  much.  We  chose  these  two  samples  only  because 
they  were  available  during  the  measurement.  Fig.  2(a)  and 
2(b)  show  the  collection  current  as  a  function  of  the  distance 
between  the  injecting  junction  and  the  collecting  junction 
for  n+p  and  p+n  diodes,  respectively.  The  reverse  saturation 
currents  at  1-V  reverse  bias  are  0.5  pA  for  the  n*p  diode 
and  0.1  p  for  the  p*n  diode.  Each  has  been  subtracted  from 
the  measured  collection  current. 

111.  RESULTS  AND  DISCUSSION 

From  Fig.  2,  it  can  be  seen  that  for  small  distances  the 
collection  current  decreases  rapidly,  but  for  large  distances 
the  decrease  in  the  collection  current  is  much  slower  The 
rapid  decrease  at  small  distances  is  explainable  by  diffusion 
of  minority  carriers.  The  approximate  decay  lengths  for  the 
minority  carriers  are  2*  and  12  pm  for  the  p-substrate  and  the 
n-substrate,  respectively.  These  are  in  good  agreemrnt  with 
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Fig  2.  (a)  The  collection  current  versus  distance  with  the  injecting 
current  as  a  parameter  for  the  6.6  X  10 * 5  cm-3  p-type  substrate, 
(b)  For  the  2.7  x  10*7  cm-3  n-type  substrate. 

reverse  recovery-time  measurement,  which  gives  a  lifetime  of 
230  ns  for  the  p-type  substrate  and  a  lifetime  of  100  ns  for 
the  n-type  substrate.  The  slow  decrease  in  the  collection  cur¬ 
rent  at  large  distances  is  not  explainable  by  diffusion  theory, 
because  carriers  can  not  decay  with  one  diffusion  length  for 
some  distance  and  then  with  another  much  longer  diffusion 
length.  The  slow  decrease  of  the  collection  current  is  believed 
to  be  the  result  of  photon  generation  of  minority  carriers. 
The  photons  themselves  are  generated  through  a  radiative 
band-to-band  recombination  process  as  described  in  the  Sec¬ 
tion  I.  They  have  energies  near  that  of  the  band  gap  (see  Fig. 
1)  thus  having  long  absorption  length.  When  the  photons 
are  absorbed,  electron-hole  pairs  are  created  in  the  substrate 
and  the  electrons  or  holes  can  then  be  collected  by  a  reverse- 
biased  junction.  In  Fig.  2,  beyond  1000  jim  (40  mil)  the 
collection  currents  can  be  fitted  with  effective  decay  lengths 
of -700  jim  (p-substrate)  and  —1 100  im  (n-eubstrate). 


The  number  of  photons  generated  per  second  by  a  forward- 
biased  junction  is  given  by 


(1) 


where  rj  is  the  quantum  efficiency,  which  is  a  function  of  the 
minority-carrier  concentration,  If  is  the  injected  current, 
and  q  is  the  charge  of  electron.  Assuming  there  is  no  optical 
reflection  from  the  top  and  bottom  surfaces  of  the  wafe\  the 
number  of  electron-hole  pairs  generated  per  second  per  volume 
by  these  photons  is 


Npa 

Aitd1 


exp  (—ad). 


(2) 


Here,  d  is  the  distance  between  the  injector  and  the  point 
where  the  electron-hole  pairs  are  generated,  a  is  the  absorp¬ 
tion  coefficient  for  these  photons  in  silicon.  The  collection 
current  is  related  to  this  generation  rate  by 


lc(d)  =  qVcG(d) 


(3a) 


..  2nLd(W/2+Ld)(LI2+Ld) 
Vr  «* - 


(3b) 


Vc  is  a  "collection  volume"  dependent  on  collector  size  and 
carrier  diffusion  length  but  is  assumed  to  be  indep  dent  of 
d.  The  formula  used  for  Vc  is  the  volume  of  an  ellipsoid  with 
axes  Ld,  WI2  +  Ld,  and  LI 2  +  Ld.  Ld  is  the  diffusion  length 
of  minority-carriers  and  W  and  L  are  the  width  and  length 
of  the  collecting  junction,  respectively.  Equations  (2)  and  (3a) 
are  approximate  formulas  which  are  good  when  d  is  much 
larger  than  the  injector  and  collector  dimensions.  To  extract 
the  a  in  (2)  we  replot  the  data  of  p-substrate  case  in  Fig.  3  as 

,  /,  tjVm 

R-d*  "  exp  (-a</)  (4) 

it  4»r 


versus  the  distance  from  the  injecting  junction.  The  figure  is 
plotted  for  distances  larger  than  700  jim  where  the  effects 
of  photons  start  to  become  important.  At  large  distances 
R  stays  almost  constant.  This  indicates  that  the  collection 
current  is  originated  from  the  absorption  of  photons  which 
have  a  long  absorption  length.  By  extrapolating  R  back  t od  = 
0  fun,  the  product  of  the  absorption  coefficient  o  and  the 
quantum  efficiency  rj  can  be  estimated  with  (4).  Fig.  4  shows 
the  measured  rj  (tj,),  with  a  used  as  an  adjustable  parameter 
to  fit  the  theoretical  quantum  efficiency  r?f,  versus  the  injected 
current.  Theoretically,  q,  is  equal  to  the  ratio  of  the  minority- 
carrier  lifetime  r  to  the  radiative  lifetime  rr  [9] 


T7 ,  =r£(n'+iV,ub)  (5) 

Tr 

ri  is  the  excess  minority -carrier  concentration.  A',ub  is  the 
substrate  concentration,  constant  B  is  2  X  10“ 1S  cm3  s_1 
(10] .  In  (4),  with  V,  =  7.1  X  10s  jim3,  r  =  450  ns  for  the 
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distance 

Fig.  3.  R  ■  <Plc/t,  remit  distance  with  the  injection  current  at  a  pa¬ 
rameter  for  the  6.6  X  101  *  cm-3  p-type  aubitrate. 


Fif.  4.  Theoretical  quantum  efficiency  and  the  quantum  efficiency 
calculated  from  (4)  venui  the  injected  current  (X)  for  the  p-sub- 

etrate,  (O)  for  the  n-nibstrate,  to  lid  line  ( - )  for  meaaurement, 

daihed  line  ( — )  for  theory. 

n+p  junctions  and  Vc  *  0287  X  10*  jim3,T  «  260  ns  for  the 
p*n  junctions  (from  (3b)),  ij,  can  have  good  agreement  with 
17,  by  setting  a  *  2J  cm-1  for  the  p-substrate  and  a  *  4.5 
cm*1  for  the  n-cubstrate  as  shown  in  Fig.  4,  a  ■  2 .5  cm-1 
is  in  reasonable  agreement  with  the  slopes  at  large  distances 
in  Fig.  3  and  within  the  expected  range  based  on  the  results 
by  Vavilov  {11) .  a  «  2 3  cm-'  is  also  close  to  the  reported 
data  by  Spitzer  and  Fan  [12] .  This  result  supports  our  argu¬ 
ment  about  photon  generation  in  forward-biased  silicon  p-n 
junctions. 

IV.  CONCLUSION 

A  forward-biased  silicon  p-n  junction  not  only  injects 
minority  carriers  into  the  silicon  substrate  but  also  generates 
photons  through  radiative  recombination.  At  small  distances 
from  the  injecting  junction,  the  measured  collection  current 
is  dominated  by  the  diffusion  transport  of  canters.  At  large 
distances  from  the  injector,  the  collection  cunent  is  mainly 


due  to  the  substrate  minority  carriers  that  are  generated  by 
the  absorption  of  photons.  The  collection  current  appears 
to  have  an  effective  decay  length  of  about  several  hundred 
to  one  thousand  micrometers  when  the  effect  of  photons 
dominates.  During  the  radiative  recombination  process,  it 
takes  approximately  104  ~  10*  electron-hole  recombina¬ 
tions  to  generate  one  photon.  The  absorption  coefficients  are 
2.5  cm-1  (p-substrate)  and  4.5  cm-1  (n-substrate)  in  the  two 
samples  studied.  The  cunent  generated  by  these  photons 
can  have  a  magnitude  much  larger  than  the  reverse  saturation 
current  of  the  collecting  junction. 

The  implications  of  photon  generation  in  the  silicon  sub¬ 
strate  may  be  illustrated  with  dynamic  RAM  circuits.  In  the 
undesirable  case  that  any  of  the  junctions  in  the  peripheral 
circuit  is  forward  biased  by  a  substrate-cunent-induced  voltage 
drop  [13]  or  by  positive  voltage  glitches  on  the  input  and 
output  lines,  errors  can  be  induced  deep  inside  the  airay 
because  of  the  long  decay  length  of  the  collection  cunent. 
The  guard  ring  and  the  epi-substrate  techniques  may  not  be 
as  effective  as  expected  in  protecting  nodes  far  from  the  110 
circuits. 
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ABSTRACT 

The  lucky  electron  concept  is  successfully  applied  to  the 
modelling  of  channel  hot  electron  injection  in  n-chamel  MOSFETs  . 
although  the  result  can  be  interpreted  in  terms  of  electron  tem¬ 
perature  as  welL  This  results  in  a  relatively  simple  e:mression  that 
can  quantitatively  predict  channel  hot  electron  injection  current  in 
MOSFETs.  The  model  is  compared  with  measurements  on  a  series 
of  n-channel  MOSFETs  and  good  agreement  is  achieved.  In  the  pro¬ 
cess,  new  values  for  many  physical  parameters  such  as  hot- 
electron  mean-fre e-path  are  determined.  Of  perhaps  even  greater 
practical  significance  is  the  qualitative  correlation  between  the 
gate  current  and  the  substrate  current  that  this  model  suggests. 
The  dominant  hot-electron  scattering  mechanism  is  due  to 
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1.  Introduction 

Recent  extensive  studies  on  short-channel  MOSFETs  have  made  much  pro¬ 
gress  in  the  development  as  well  as  the  understanding  of  the  limitations  of  VLSI 
circuits  [1,2].  One  important  aspect  of  the  physics  of  the  short-channel  MOSFETs 
is  the  injection  of  hot-electrons  from  the  channel  into  the  gate  [3,4,5].  Channel 
hot-eiectron  injection  (  CHEI  )  into  the  gate  can  result  in  the  degradation  of  dev¬ 
ice  performance  due  to  the  trapping  of  electrons  in  the  gate  oxide  and  the  gen¬ 
eration  of  interface  traps.  The  phenomenon  of  channel  hot-electron  injection  is 
also  widely  used  as  the  programming  mechanism  in  EPROMs.  Therefore  a  simple 
quantitative  model  of  the  CHEI  effect  in  MOSFETs  would  be  useful  for  the  under¬ 
standing  and  design  of  future  VLSI  devices.  At  high  enough  drain  voltages,  CHEI 
can  be  measured  directly  as  gate  currents  [3].  Indirect  measurements  down  to 
very  small  currents  utilizing  floating  gate  MOSFETs  have  also  been  made  [5] 
although  results  from  these  measurements  are  more  difficult  to  interpret  due  to 
the  complex  device  structures  [6]. 

There  are  two  separable  parts  in  the  task  of  modelling  CHEI.  The  first  part 


is  to  find  the  electric  field,  particulatly  the  maximun  field,  in  the  channel.  A 
more  reliable  means  of  finding  the  field  are  2-D  or  3-D  device  computer  simula¬ 
tions  [7,8,9]  although  good  accuracy  has  also  been  reported  for  an  analytical 
field  model  based  on  pseudo-two-dimensional  considerations  [10].  The  second 
part  is  to  model  CHEI  in  terms  of  the  channel  electric  field.  The  second  part  is 
the  subject  of  the  present  paper.  This  is  most  often  done  using  the  concept  of 
electron  temperature.  However  no  reliable  theory  or  experiments  has  yet  been 
developed  relating  the  field  and  the  electron  temperature  [11].  Consequently, 
all  CHEI  models  have  been  either  empirical  in  nature  [3]  or  computationally 
complex  and  untested  [12]. 

In  this  paper,  we  expand  on  and  make  a  more  thorough  presentation  of  a 

physical  model  for  CHEI  in  MOSFETs  based  on  the  lucky-elactron  concept  [13]. 

Direct  measurements  of  channel  hot-electron  injection  as  MOSFET  gate  current 

is  used  to  evaluate  the  model.  Studies  are  made  on  pclysiliccn-gate  MOSFETs 

with  arsenic  doped  source/drain  regions.  The  model  to  be  presented  assumes 

that  the  maximun  channel  field  in  the  direction  of  the  channel  current  is  known. 

Experimentally,  in  this  paper,  the  maximun  field  is  deduced  from  the  measured 

substrate  current  A  by-product  of  this  study  is  the  reaffirmation  ,  by  theory 

and  experiments,  of  a  correlation  between  the  gate  current  and  the  substrate 

current  (  at  least  when  the  gate  voltage  is  higher  than  the  drain  voltage)  [14]. 

»• 

2.  Model 

The  streaming  or  lucky  electron  approach  of  modelling  the  hot-electron  dis¬ 
tribution  was  originated  by  Shockley  [15].  Later  Verwey  et.  al.  [16]  used  this 
approach  in  their  study  on  substrate  hot  electron  injection  in  MOSFETs  which 
was  later  refined  and  verified  by  Ning  et.  al  [17].  Hu  [13]  modified  the  substrate 
lucky  electron  injection  model  and  applied  it  to  CHEI  in  MOSFETs. 


Conceptually,  tbe  lucky  electron  model  of  CKEI  can  be  described  as  follows. 
In  order  for  channel  bot-electrons  to  be  reach  to  the  gate,  the  hot-electrons 
must  gain  sufficient  kinetic  energy  from  the  channel  field  and  has  its  momen¬ 
tum  redircted  elastically  to  surmount  the  potential  barrier  at  the 
silicon/silicon-dioxide  interface.  The  momentum  of  these  hot-electrons  must 
then  be  re-directed  elastically  toward  the  silicon/silicon-dioxide  interface.  To 
quantify  the  probability  that  these  electrons  could  eventually  be  collected  by 
the  gate,  several  types  of  inelastic  scatterings  have  to  be  considered  (  figure(l) 
).  From  point  A  to  B.  a  channel  electron  gains  energy  from  the  channel  field  and 
becomes  "hot”.  At  B,  re-direction  of  the  hot-electron  takes  place.  From  point  B 
to  C,  (  C  is  situated  at  the  interface  ),  the  hot-electron  must  not  suffer  any 
energy-robbing  collision  so  that  it  will  retain  the  energy  required  to  surmount 
the  silicon/silicon-dioxide  potential  barrier.  The  hot-electron  must  also  suffer 
no  collision  in  the  oxide  image-potential  well  located  between  C  and  D.  Once  the 
hot-electron  arrives  at  location  D,  it  will  be  swept  toward  the  gate  electrode  by 
the  aiding  field.  In  the  following,  we  shall  analy2e  mathematically  these  various 
processes  involved  separately. 

2. 1.  Probability  of  Acquiring  Sufficient  Normal  Momentum 

A  schematic  illustration  of  the  injected  hot  electron  in  the  potential- 
distance  space  is  shown  in  figure(2).  In  order  for  the  hot-electron  to  surmount 
the  silicon/silicon-dioxide  potential  barrier  (  in  volts  ),  its  kinetic  energy 
must  be  greater  than  To  acquire  the  kinetic  energy  the  hot-electron  will 

<$. 

have  to  travel  a  distance  d  where  d-  ■= — if  we  assume  the  electric  field  Ex  to  be 
constant.  The  probability  of  a  channel  electron  to  travel  a  distance  d  or  more 

-<t 

without  suffering  any  collision  can  be  written  as  [3]  e  *  ,  where  \  is  the  scatter¬ 
ing  mean-free-path  of  the  hot  electron.  The  interpretation  cf  \  will  be  discussed 


later.  (X  depends  on  the  optical-phonon  scattering  mean-free-palh  and  the 


impact-ionization  mean-free-path.)  Hence,  we  can  write  e  -  as  the  probability 
that  an  electron  will  acquire  kinetic  energy  greater  than  the  silicon/silicon- 
dioxide  potential  barrier. 

If  an  electron  is  to  be  emitted,  its  momentum  must  be  re-directed  toward 
the  silicon/silicon-dioxide  interface  by  an  elastic  scattering  and  has  a 
sufficiently  large  momentum  component  perpendicular  to  the  interface.  An 
electron  that  possesses  exactly  the  energy  will  be  emitted  only  if  its  momen¬ 
tum  is  directed  into  an  infinitesimally  small  solid  angle  normal  to  the  Si/SiOg 
interface.  Assuming  isotropic  re-direction  scatterings,  an  electron  possessing 
energy  (  $=$ft+A$  )  would,  due  to  geometrical  consideration  only  has  the  proba¬ 
bility  to  surmounting  the  barrier  £  18]  as, 


~  A$ 

1>7+Aij 


(1/2)1- 


a© 

where  — - — is  correct  for  A$«$a.  The  probability  of  an  electron  to  have  the 
kinetic  energy  between  +A$  and  +A$+d  (A$)  is 


(2) 

The  probability  of  an  electron  having  enough  normal  momentum  to  surmount 


the  silicon/silicon-dioxide  potential  barrier  can  be  evaluated  by  integrating  the 
product  of  equations(l)  and  (2)  over  all  A$.  Then  the  probability  of  an  electron 
acquiring  the  required  kinetic  energy  and  retaining  the  appropriate  momentum 
after  redirection  can  be  expressed  as. 


ZxK  d(A$)  _  p  25  c~ V 
Ez\  ~  ' 


(3) 


2.2.  Probability  of  Collision-Free  Travel  to  the  Barrier  Peak 

We  now  evaluate  the  probability  that  a  hot  electron  travels  to  the  Si-SiOg 
interface  without  suffering  any  collision,  Pj  ,  after  undergoing  a  re-directing  col¬ 
lision  at  varying  depths  below  the  interface.  Here,  Pl  is  a  scattering  probability 
factor  weighted  by  the  electron  concentration  in  the  inversion  layer.  If  we  have 
n(y)  as  the  electron  concentration  at  depth  y  and  position  x  in  the  channel  (  see 
figure(3)  ),  then  Pi  can  be  expressed  as 


■  -X. 

J  n(y)e  Ady 

p1=*±L -  (4) 

fn(y)dy 

y=0 

The  exponential  term  in  equation(4)  is  the  probability  of  not  suffering  any 
energy-robbing  collisions  as  described  previously.  In  order  to  find  n(y).  we  need 
to  find  the  potential  ?(y)  by  solving  the  Poisson  equation, 

it*  ly ■ =  (5) 

By  assumming  strong  inversion  and  the  gradual  channel  approximation  ( 
&E  QE 

ie.  ■■  v  »  -■-*  )  equation(5)  can  be  solved  (  see  Appendix(l)  )  and  we  have 
oy  Ox 


Pl=l-aeaEl(a) 


where 


_  6k T  5.172x!0~4r 
“  q\E„  "  E,z\ 


E°**x 


(V9,-V*) 


,T  is  the  absolute  temperature,  and  Ei  is  the  exponential  integral.  In  short 


channel  devices,  we  need  to  include  the  —•  term  in  equation(5)  when  V*.  is 

Ox 

sufficiently  high.  Appendix(2)  shows  how  this  can  be  done  approximately. 


The  last  probability  factor  we  have  to  consider  is  the  scattering  in  the  oxide 


image-potential  well.  Here,  the  probability  Pz  can  be  written  as  [1BJ 


v, 

Pz-e  X«  (7) 

where 

V8  =  V  16  tt£„£bx 

and  A«-=3.2nm.  [19].  Combining  the  constants,  we  arrive  at  the  expression  for 
Pz  as 


-SCO 

(8) 

where  E„  is  in  V/cm. 

We  shall  define  the  product  of  Pl  and  Pz  as  P  which  is  essentially  only  a 
function  of  Eaz.  P{EC~)  can  be  approximated  by  (  ref.[l4]  and  Appendix(2)  ) 

-SCO 

2  (9) 


for  Es=^ 0  and 


P(E,S)  * 


5.65xl0-8ro. 


(1+ 


1.45x10s 


-) 


(1+S5i2^ 

L*n 


1.5 


-+  2.5x10-® 


P(Eaz)  *  2.5xl0-2e  A« 
for  E,-  <0. 


(9a) 


2.3.  Evaluation  of  the  CHEH  Gate  Current 

We  can  now  express  the  gate  current  I3atg  in  terms  of  the  probabilities 
described  in  the  previous  sections,  as 

/,*.=/*/ PttP(E„)£-  (10) 

where  L  is  the  length  of  the  channel  (  ie.  from  the  source  to  the  drain  metallur- 
gial  junction  or  even  beyond  )  ,  Ar  is  the  re-direction  scattering  mean  free  path. 

The  factor  can  be  interpreted  as  the  probability  of  re-dircction  over  dr. 
Hence,  the  integral  in  equation(lO)  gives  the  total  probability  of  CHE1.  The 


-7- 


K 


m 


parameter  \r  will  be  discussed  later. 

Since  the  probability  Pfk  depends  exponentially  on  F-  and  Ez  also  varies 
exponentially  with  x  [10,20],  the  integrand  in  equation(lO)  is  a  sharply  peaking 
function.  Therefore,  an  approximate  expression  for  the  integral  is 

do 

where  LL  is  the  length  of  the  region  of  significant  CHEI.  In  the  case  of  Vg>Vd, 
[PaP(F0-)]  occurs  at  the  drain  where  Pt.  or  Es  is  maximun.  A L  may  be 

I  •  lm*»  • 


l  »  I  max 

approximated  by 


dPt 


•evaluated  at  x=L  .  (  Note  that  dP(E,x)/  dx  is  usually 


dx 


much  smaller  than  dPtk/dx  .  ) 

Equation  (10)  is  approximately. 


Jgal% 


-n  ^  rtl<uP(E„\L)  _£mX 

K  ii(dE./dx)  \L 

(12) 

*  0.5x^^(^-) V(F0,|i)e"^r 

(12a) 

where  rt  is  the  channel  electric  field  at  the  drain  end  and  we  have  used 

|cfFj  /  dzj^  ^  2X~  in  ecluati0R(12a)  [20]. 


2.4.  Correlation  to  Substrate  Current 

Substrate  current  in  the  MOSFET  is  due  to  the  impact  ionization  of  the  hot 
electrons  in  the  drain  high  field  region.  Since  the  hot  electrons  responsible  for 
CHEI  and  those  responsible  for  the  substrate  current  are  heated  by  the  same 
field  ,  there  should  be  a  correlation  between  these  two  processes.  This  correla¬ 
tion  has  been  studied  and  verified  .  The  substrate  current  is  known  to  be 
related  to  Fm  by  [14], 


Art 


Bi(dEx/ dx)\i 


(13) 


If 


13 

% 


•3 

'H 


t  » 
•  1 

*\ 


3 

i 


VI 


where  Ai  and  Bi  are  the  pre-exponential  and  exponential  constants  in  the  ioniza¬ 
tion  coefficient  [21].  Eliminating  Em  from  the  exponential  term  in  equation  (12) 


using  equation  (13), 


cp(E.,H-f 
■**  '* 


*6 

^ nib  \ 


where 


'  (»> 

*0.25 3'K(2Xoz)  S‘X  (15a) 

Ar$f  A 

>-A- 

d£.' 

The  approximate  used  in  obtaining  equation.(l2a)  is  used  in  obtaining 

equation(l5a).  The  constant  K  has  a  value  of  approximately  1.5xl0-9  using  the 
results  presented  in  section(4)  and  assuming  that  £’nl  =  2xl05  Vcm~l.  The  slope 

in  the  log -log  plot  of  versus  will  be  equal  to  p6.  .  Since  Bx  and  the 

/ *  /*  Oi\ 

hot-electron  scattering  mean-free-path  X  are  independent  of  Eox,  the  slope  pro¬ 
vides  a  method  to  determine  the  barrier  height  and  its  dependence  on  E„- 
This  subject  will  be  discussed  in  section(4)  . 


3.  Experimental  Results  and  Discussion 

Experiments  on  CHE1  are  carried  out  on  a  series  of  n-channel  polysilicon 
gate  MOSFETs  where  the  processing  parameters  of  the  test  devices  are  listed  in 
Table(l).  The  test  transistors  have  a  channel  width  of  100  pm.  In  the  experi¬ 
ments,  the  gate  current  (  Ijatt  )  ,  substrate  current  (  ),  and  smirce  current  ( 

It  )  were  measured  simultaneously.  The  resolution  of  the  gate-current  measure¬ 
ment  was  approximately  1  fA. 


fljj 

if? 

y\ 

vw.vav, 
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Figure(4)  is  a  typical  plot  of  CHE1  gate  current  as  a  function  cf  the  gate-to- 
source  voltage  (  Vgs  )  at  constant  drain-to- source  voltage  (  1^-  ).  The  impact- 
ionization  substrate  current  is  also  shown.  The  femiliar  teil  shaped  curves  are 
observed.  Qualitatively,  the  dependence  of  the  gate  current  cn  Vgi  and  1^,  can 
be  explained  as  follows.  The  channel  electric  held  in  the  MC3FIT  is  proportional 
to  the  difference  between  V&  and  the  drain  saturation  voitage  (  Y*„,  )  [10].  At 
low  Vgs,  Vjsax  is  small  and  hence  the  channel  electric  held  at  the  drain  is  high. 
However,  due  to  the  fact  that  at  low  If,.  the  oxide  electric  field  at  the  drain  end 
is  in  a  direction  that  mil  inhibit  the  collection  of  the  hot-electrons  by  the  gate 
electrode,  we  expect  no  measurable  gate  current  coming  from  the  portion  of  the 
channel  with  Vcfi > Vg .  When  K,  increases  towards  I'v,.  the  oxide  held  near  the 
drain  end  becomes  more  favorable  and  we  see  a  sharp  increase  in  gate  current, 
hhen  Vss  increases  further.  I^jcu  "'ill  also  increase  (although  at  a  slower  rate 
due  to  the  velocity  saturation  effect  [10]  )  and  the  peak  channel  electric  field 
decreases,  so  the  bell  shaped  curve  of  the  gate  current  results  (channel-field 
limited  regime). 

In  figure(5),  the  measured  gate  current  for  the  test  devices  are  shown 
against  V&.  The  gate-to-source  voltage  is  fixed  at  10  volts.  The  dependence  of 
the  gate  current  on  the  channel  length  is  apparent  Reducticn  of  the  channel 
length  reduces  l(ka t.  Therefore  for  the  same  drain-to-source  voltage,  the  chan¬ 
nel  electric  field,  and  hence  fgalt,  is  higher  in  shorter  channel  devices.  The  dev¬ 
ices  with  thinner  oxide  thickness  (figure(5b))  has  higher  gate  current  because 
the  channel-electric  field  is  higher. 

Also  observed  is  the  flattening -off  and  in  some  cases,  a  decrease  of  the  gate 


current  when  V^,  approaches  and  eventually  exceeds  V;s.  This  is  the  same  elec¬ 
trode  limited  behavior  we  described  in  the  last  paragraph.  Figure(oa)  and  (6c)  J 

illustrates  the  band  diagrams  for  the  channel-field  limited  and  the  electrode- 


I 


limited  regimes.  When  is  smaller  than  Vgs,  the  oxide  field  at  the  point  of 
maximun  channel  electric  field  (ie.  the  drain  end  )  is  in  a  direction  favorable  to 
the  collection  of  the  injected  electrons  by  the  gate  electrode.  When  V, i,  is  equal 
to  Vgg,  the  oxide  field  is  zero  at  the  drain  end  of  the  channel.  At  this  or  higher 
Vi/s,  the  hot-electrons  at  the  drain  becomes  increasing  mere  difficult  to  reach 
the  gate  electrode  and  the  gate  current  will  be  dominated  by  injection  at  loca¬ 
tions  closer  to  the  source  where  the  channel  field  is  weaker  but  the  oxide  field  is 
still  favorable.  Therefore,  when  is  greater  than  V  we  would  expect  the 
observed  gate  current  to  flatten.  Since  there  wiil  be  a  finite  probability  that 
some  hot-electrons  will  be  trapped  in  the  gate  oxide  at  the  peak  injection  point, 
the  oxide  field  at  the  peak  injection  point  will  decrease  rapidly  and  the  peak 
injection  point  will  move  further  toward  the  source.  These  chain  of  events  may 
eventually  lead  to  a  decrease  in  gate  current  when  exceeds  VgS.  This  indeed 
is  observed  in  our  experimental  results.  In  figure(7),  we  have  shown  the  depen¬ 
dence  of  measured  Igat ,  versus  time.  Although  the  biasing  voltages  are  fixed,  we 
observe  that  Igatt  decreases  with  increasing  time.  This  decrease  can  be  attri¬ 
buted  to  the  trapping  of  hot-electrons  at  the  peak  injection  point  just  described. 

4.  Analysis  of  Experimental  Results 

In  the  previous  section,  we  have  presented  the  measured  gate  current  for  a 
number  of  devices.  In  order  to  compare  our  experimental  results  with  the  lucky 
electron  "  model  ,  we  shall  focus  on  the  normalized  gate  current  where  this 
quantity  is  defined  as  the  ratio  of  the  gate  current  to  the  source  current  (  ie. 

)  of  the  MOSFET.  The  source  current  is  essentially  equal  to  the  drain 

current,  differing  only  by  1^.  This  is  dmonstrated  in  figure(8)  where  we  have 
replotted  the  data  presented  in  figure(5a).  The  solid  and  the  dash  lines  in 

figure(8)  are  the  calculated  — y*—  which  will  be  discussed  later.  In  view  of  equa- 


tion(10),  the  normalized  gate  current  can  be  Lntepreted  as  the  total  probablity 
of  CHE1.  In  figure(9),  we  have  plotted  the  constant  narmalized  gats  current  con¬ 
tour  (ie.  the  L ,//  and  V&  at  constant 

■*» 

4.1.  Potential  Barrier  and  Correlation  to  frutl 

The  first  parametrer  we  shall  consider  is  the  effective  potential  barrier 
between  the  silicon  conduction  band  edge  and  the  silicon-dioxide  conduction 
band  edge.  This  potential  barrier  has  been  found  to  be  [17] 

$„=3.2-/3VI£-tf£j'  V  (16) 

The  quantity  3.2  V  is  the  Si-SiOg  interface  barrier.  The  second  term  in  equa¬ 
tion^)  represents  the  barrier  lowering  effect  due  to  the  image  field  [18].  The 
last  term  in  equation(lG)  accounts  phenomenologically  for  the  finite  probability 
of  tunneling  between  the  silicon  and  the  silicon  dioxide  [17],  For  SiOg, 
/3=2.59xl0~*  (V  cm)\  The  parameter  d  will  be  determined  by  comparison  with 
our  experimental  results. 

Figure(lO)  is  the  log -log  plot  of  the  normalized  gate  current  versus  the  nor¬ 
malized  substrate  current  using  the  gate-to-drain  voltage  Vgt  as  the  parameter. 
For  a  constant  the  oxide  electric  field  Etr  at  the  location  of  maximun  Ex, 
(ie.  the  drain),  is  constant  (parasitic  drain  diffusion  resistance  may  affect  EtI 
slightly).  As  discussed  in  section(2),  the  slope  in  this  plot  gives  the  quantity 

-■  ■  ■?  In  figure(ll).  the  experimentally  determined  slopes  are  plotted  against 

A 

the  calculated  E„  (effect  of  drain  diffusion  resistance  included)  using  the  data 
in  flgure(lO).  A  best  fit  to  the  dependence  of  on  Eas  is  obtained  by  selecting 
d  so  that  the  calculated  Bl  and  X  product  will  have  the  minimun  variance  over 
all  Eax  considered.  By  adopting  this  procedure,  we  choose  d  to  be 

L  L  j_  L 

4xlO'3F3 cm3.  This  is  in  contrast  with  the  value  lxlO_3K3cm3  suggested  by 
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Ning  et.  al.  [17].  The  product  F,A  then  has  the  value  cf  1.2  .-V  and  a  variance  of 
-2.7%  over  all  £az.  This  value  of  FjA  is  then  fixed  in  all  subsequent  calculations. 
The  theoretical  curves  shown  in  figure(lC)  are  based  on  equation(l4)  where  a 
value  of  y4<  =  1.6xl05Tfcm._1  is  assumed  .  The  dependence  of  on  £ox  is  also  illus¬ 
trated  in  figure(ll). 

4.2.  Comparison  of  Igalz  Measurements  and  Model 

With  the  silicon/silicon-dioxide  potential  barrier  determined,  we  are  ready 
to  make  direct  comparisons  between  the  lucky  electron  model  and  experimental 
results.  In  figure(8),  we  have  shown  the  theoretical  gate  current  curves  calcu¬ 
lated  by  integrating  equation(ll)  numerically  (  solid  lines  ).  The  channel- 
electric  field  is  determined  from  a  simple  quasi-2D  M03FET  model  formulated  by 
Ko  [10].  The  effective  re-direction  mean-free-path  \r  and  the  hot-electron 
scattering  mean-free-path,  A  are  the  only  fitting  parameters  in  the  calculation  of 

The  fit  is  insensitive  to  the  re-direction  mean-free-path  Ar,  which  is 

h 

chosen  to  be  Bl.Snm  based  on  theoretical  consideration  as  discussed  in  sec¬ 
tion^).  With  this,  we  now  have  A,  the  hot-electron  scattering  mean-free-path  as 
the  only  fitting  parameter.  A  value  of  9.2  nm  for  A  gives  the  best  fit  for  all  chan¬ 
nel  lengths  considered.  We  have  also  included  the  calculated  gate  current 
curves  using  the  approximate  analytical  expression  presented  in  equation(l4)  ( 
dash  lines  ).  Good  agreements  are  obtained  between  the  theoretical  model  and 
the  experimental  results. 

In  figuies'l2a)  and  (12b),  we  have  used  the  correlation  between  the  gate 
current  and  the  substrate  current  that  was  presented  in  section(2.4)  (equa¬ 
tion^))  to  calculate  the  normalized  gate  current  from  measured  7JUt) .  The  solid 
lines  are  calculations  based  on  equations(l4<kl5)  while  the  dash  lines  are  calcu¬ 
lations  based  on  equaticns(l4&15a).  In  all  the  calculations,  the  Ft  and  A  product 


A  MVA 


v  *■« 


of  1.24V  is  used.  The  values  of  X  and  \r  used  here  are  the  same  as  those  used  in 
the  calculation  shown  in  figure(lO).  The  agreement  between  experimental 
results  and  calculations  is  very  good. 

In  flgure(l3),  we  have  shown  the  calculated  normalized  gate  current  curve 
for  the  data  shown  in  figure(4).  A  direct  numerical  integration  approach  is  used 
here.  The  bell  shaped  dependence  of  the  normalized  gate  current  on  is 
obtained.  There  is  some  discrepencies  between  uie  calculated  and  the  experi¬ 
mental  results  at  the  high  gate  current  regime.  This  probably  is  due  to  the 
effects  of  electron  trapping  on  the  oxide  field  at  the  peak  injection  point  that 
was  discussed  in  section(3). 

4.3.  Effects  of  Temperature 

In  figure(14),  we  have  shown  the  effect  of  temperature  on  the  CKEI  gate 
current.  The  CHEI  gate  current  decreases  with  increasing  lattice  temperature. 

The  temperature  coefficient  (  T.C.  =  //»))_)  -s  eXperimsn tally  found  to 

be  -0.025  C~l  (  ie.  doubling  of  Igat ,  for  every  ~  23’  C  decrease  in  temperature). 
This  agrees  with  a  previous  report  [4]  where  a  T.C.  of  -0.0299  C~*  was  obtained. 
The  temperature  dependence  of  the  substrate  current  is  also  shown  in 
figure(l4).  The  temperature  coefficient  for  the  substrate  current  is  -0.0132C-1 
which  is  smaller  than  that  for  the  gate  current.  (Ref.[4]  obtained  a  substrate 
current  T.C.  of  -0.01036  C-1.)  The  decrease  in  the  CHEJ  gate  current  is  due  to 
the  reduction  in  hot-electron  scattering  mean-free-path  (X).  The  decrease  in 
the  impact  ionization  substrate  current  with  temperature  is  due  to  the  reduc¬ 
tion  in  optical-phono i  mean  free  path  (  X*p  )  [25].  From  equation(14),  one 

$6 

expects  the  two  slopes  to  differ  by  a  factor  of  ~  2.1,  in  good  agreement  with 

the  temperature  coefficients  given  above.  The  eventually  increase  of  Igatt  and 
Im  at  high  temperatures  is  due  to  the  increase  of  the  intrinsic  thermal 


generation  of  carriers  which  results  in  substrate  hot  electron  injection.  In  view 
of  the  analytical  expression  for  CKEI  gate  current  presented  in  equaticn(l2),  we 
see  that  the  predominant  temperature  dependence  of  CKEI  comes  in  throught  X 
in  the  exponential  form.  As  a  first  order  approximation  and  for  the  same  biasing 
condition,  we  expect  the  maximun  channel  electric  field  Em  and  /*  to  be 
independent  of  temperature.  The  ratio  of  the  normalized  gate  current  at  two 
temperatures  is 


In  obtaining  equation(17).  we  have  assumed  the  effects  of  the  temperature 
dependence  of  \,=  and  Xr  to  be  small  when  compared  to  the  effects  due  to  X. 
This  is  a  relatively  good  assumption  because  X  enters  equation(12)  in  an 
exponential.  If  the  scattering  of  the  hot-electrons  are  dominated  by  scattering 
due  to  optical-phonons,  then  we  shall  expect  the  temperature  dependence  of  X 
to  be  similiar  to  that  of  optical  phonon  scattering  me an-fre e-path  [25], 

Xop(T)=X0tanh(^r)  (IB) 

where  Xo  is  the  hot-electron  scattering  mean-fres-path  as  T  approaches  0 °K  and 
Ep  is  the  optical-phonon  energy.  In  figure(l5).  we  have  shown  the  change  in  the 
normalized  gate  current  versus  temperature  obtained  from  the  data  presented 
in  flgure(l4).  The  temperature  dependence  of  our  experimental  results 
corresponds  very  well  with  that  of  the  optical-phonons.  A  value  of  9.2  nm  is  used 
for  the  X  at  room  temperature  (  ie.  33°  C  )  and  good  agreement  is  obtained  when 
Ep  =0.070  eV.  From  equation(l7),  X,  is  calculated  to  De  10 .6nm.  This  is  in  excel¬ 
lent  agreement  with  the  10.5nm  determined  by  Xing  et  ai.  [17]  but  larger  than 
the  7.6  nm  reported  by  Sze  [25].  For  comparison,  we  have  also  included  the 
experimental  results  obtained  from  Matsumoto  et  al.  [-1]  in  figure(l5).  The 


excellent  agreement  between  the  simple  approximation  shown  in  equation(l7) 
and  the  experimental  results  indicated  that  the  principal  energy  lossing 
mechanism  is  due  to  optical  phonons  scatterings.  Again  the  deviation  between 
data  and  theory  at  high  temperature  is  believed  to  be  due  to  substrate  hot  elec¬ 
tron  injection  [17]. 


5.  Discussion 


We  shall  in  this  section,  try  to  interpret  the  physical  meanings  of  A,  the 
channel  hot-electron  scattering  me an-fre e-path  and  Ar,  the  momentum  re¬ 
direction  mean  free  path.  A  way  to  estimate  the  channel  impact-ionization 
threshold  energy  (  Ei  in  eV  )  will  be  presented. 

As  pointed  out  by  many  authors  [12,14,16,22],  the  channel  hot-electron 
scattering  mean-free-path  (  A  )  depends  on  the  optical-phonon  scattering  (m.f.p. 
=Ap)  and  the  impact-ionization  (m.f.p.  -Xj)  mean-free-paths.  We  see  that  when 
the  hot-electron  kinetic  energy  (KE.)  is  less  than  Ej,  the  predominant  scatter¬ 
ing  mechanism  will  be  due  to  optical-phonons.  When  the  KE.  of  the  hot- 
electrons  become  larger  than  Ej  (ie.  Ej£  KE.  both  the  optical-phonon 

scattering  and  the  impact-ionization  scattering  will  be  significant.  In  this  energy 
range,  the  effective  scattering  m.f.p.  (A')  should  be  formulated  as 

-r-=  +  “s  Due  to  Verwey  et  al.  [16],  we  have  the  probability  for  a  channel 

A  Ap  A/ 

»• 

hot-electron  to  acquire  the  energy  $>t  to  be 


-<*.-5* 


Prob.  ($6)  ~  e  e 


By  letting 


7$6,  this  probability  can  be  written  as 


Prob.  (*>)  ~  e  A' 


(20) 


By  comparing  the  above  equation  with  the  formulation  presented  in  section(2. 1), 
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we  can  readily  obtain 


h  57*  (2I) 

Therefore,  equation(2l)  implies  that  the  channnel  hot-electron  scattering  m.f.p. 

is  a  combination  of  Xp  and  X/  where  the  influence  of  impact-ionization  is 

( 

weighted  by  (  1— 7  ).  This  is  different  to  the  common  belief  that  X-1  =  X^'+X/"’ 
[12,22].  In  view  of  this  and  the  large  value  of  X^^Onm),  it  may  be  consluded 
that  the  dominant  hot-electron  inelastic  scattering  at  the  channel  of  the  MOS- 
FET  is  due  to  optical-phcnons.  This  further  justify  the  assumption  we  made  in 
section(4.3)  when  formulating  the  temperature  dependence  of  X.  Using  the 
result  that  ffiX=1.24l/  and  X=9.2 nm  ,  v:e  have  3i  =  1.34xlCsKc7n."1.  This  is 
believed  to  be  more  accurate  than  many  values  found  in  the  literature  that 
ranges  between  l.OBxlOHcm-1  and  1.75x10s icm'1  [22-25]. 


The  momentum  re-direction  mean-fre e-path  has  been  interpreted  as  the 
m.f.p.  of  scatterings  where  momentum  relaxation  takes  place  without  significant 
energy  exchange.  As  pointed  out  by  Long  [27]  ,  the  two  most  likely  candidates 
for  elastic  scatterings  in  Si  are  intervally  accutical-phcnon  scatering  (  g- 
scatterings  )  and  long-wavelength  acoutical-pbonon  scattering.  In  our  calcula¬ 
tions,  the  value  of  Xj.  is  set  to  the  combined  m.f.p.  for  the  long  wavelength 
acoutic  phonon  and  the  intervally  acoutic  phonon  scattering  which  is  equal  to 
61.6  nm  as  obtained  by  Duh  et  al.  [23], 


In  our  discussion  on  the  correlation  between  the  C.H.S.I.  gate  current  and 
the  impact-ionization  substrate  surrent,  the  slope  in  the  log( versus 

*9 

-  • 

/  $  —  -  " 

log(  ■•—-)  plot  yields  the  quantity  — 6  -.  If  we  interpret  the  e  E*  in  the  impact- 
/»  BiX 


ionization  coefficient  as  e  ,  then  the  quantity  mey  be  re-written  as 

£3t  A 

^  In  the  last  step,  we  have  used  cquation(21)  and  assumed 

Bj  h  7  x  7 
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Xj»Xp.  (  Xj  was  determined  by  Schockley  [15]  to  be  BSr.m.,  by  Troutman  [22]  to 
be  40 nm  and  by  Verwey  [16]  to  be  1877im.  )  Experimentally,  the  slope  ,  — -r-, 

Hi  A 

and  hence  7,  may  be  determined  and  we  obtained  7=0.429.  This  in  tern  implies 
that  Ej  equals  to  1.23eK. 


6.  Reconciliation  between  the  Lucky  Electron  Model  and  the  Effective  Tempera¬ 
ture  Model 

The  CHEI  model  presented  in  this  study  is  based  cn  the  lucky  electron  con¬ 
cept.  There  is,  however,  another  approach  which  is  based  on  the  effective  hot- 
electron  temperature  [3,11.29].  In  the  effective  temperature  Te  approach,  the 
CHEI  gate  current  can  be  formulated  as 
<r*i 

Prob.($b)  ~  e  kT>  (22) 

where  k  is  the  Boltzman  constant  and  T,  is  the  effective  hot-electron  tempera¬ 
ture.  A  recent  experimental  study  suggested  that  on  very  short-channel  1I0S- 
FETs.  the  pure  ballistic  argument  used  in  the  lucky  electron  concept  probably  is 
less  accurate  than  the  quasi-thermal  equilibrium  effective  temperature 
approach  [ll].  Nevertheless,  the  lucky  electron  based  hot-electron  model  was 
found  to  be  able  to  describe  the  hot-electron  phenomena  well.  From  this,  the 

empirical  relationship  of  T,  =  ^-£iA=1.07xlO"z£'x  is  proposed  [ll]. 

At  this  point,  we  like  to  point  out  that  the  basic  lucky  electron  energy  con¬ 
sideration  is  used  mainly  to  derive  the  probability  Pf  in  section(2.1).  The  for¬ 
mulation  on  the  other  probabilities  (ie.  Px  and  P2  )  are  quite  independent  on  the 
lucky  electron  energy  consideration  and  therefore  are  still  valid.  The  only 
modification  to  the  model,  if  the  effective  electron  temperature  concept  is  used, 
is  Ptk,  which  will  be 


PU 


=  0.25x 


(23) 


7.  Conclusion 


We  have  presented  a  quantitative  model  for  the  channel  hot-electron  injec¬ 
tion  in  MOSFETs.  The  model  is  based  on  the  lucky  electron  concept  ,  although 
the  result  can  be  interpreted  in  terms  of  electron  temperature  as  well.  Three 
probabilities  are  derived  to  describe  the  physical  mechanisms  responsible  for 
CHEI  gate  current.  They  are,  (i)  probability  of  a  hot-electron  to  gain  enough 
kinetic  energy  and  normal  momentum,  (ii)  probability  of  not  suffering  any  ine¬ 
lastic  collision  during  transport  to  the  Si-SiOg  interface,  and  (iii)  to  suffer  no  col¬ 
lision  in  the  oxide  image-potential  welL  An  important  result  of  this  study  is  the 
reaffirmation,  by  theory  and  experiments,  of  a  correlation  between  the  gate 
current  and  the  substrate  current.  The  theoretical  model  agrees  well  with 
experimental  results. 

Based  on  experimental  results  and  theoretical  consideration,  the  depen¬ 
dence  of  Si-SiOg  effective  barrier  height  on  the  oxide  electric  field  is  evaluated. 
A  coefficient  that  accounts  phenomenologically  for  the  finite  probability  of  tun- 

i.  L 

neling  is  determined  to  be  4xlO-5K3cm3.  This  is  in  contrast  with  the  value  of 
i_  z_ 

lx  10"SK3  cm  3  obtained  by  Ning  et  al.  [17]. 

From  the  temperature  denpendence  of  CHEI,  we  concluded  that  optical- 
phonon  scattering  is  the  dominant  inelastic  scatterings  suffered  by  the  channel 
hot-electrons.  The  effect  of  impact-ionization  mean-free-path  on  the  hot- 
electrons  is  found  to  be  less  significant.  A  value  of  9.2 nm  for  the  hot-electron 
scattering  mean-free-path  at  33°  C  is  obtained.  Based  on  the  temperature 
dependence  study,  we  found  that  an  optical-phonon-energy  of  70 meV  best 
describe  our  results  and  also  agrees  with  the  data  presented  by  Matsumoto  et 
al.  [4].  From  this,  we  determined  that  A0  equals  to  10,6nm  which  is  in  excellent 
agreement  with  that  obtained  by  Xing  et  al.  [  17]. 


The  momentum  rs-direction  scattering  in  the  channel  chat  is  necessary  for 
CKS1  can  be  characterized  by  the  momentum  re-direction  mean-free-path  \r. 
Eased  on  theoretical  consideration,  we  attribute  this  to  be  the  combined  m.f.p. 
for  intervally  acoutical-phonon  scattering  and  the  long-vavelength  acoutical- 
phonon  scattering  (  m.f.p.  =31.6nm  ).  Hov.-ever,  the  model  presented  in  this 
paper  is  insensitive  to  the  exact  value  of 

From  the  detailed  consideration  on  the  impact-ionization  coefficient  and 
the  hot-electron  scattering  mean-free-path,  we  derived  an  impact-ionization 
energy  of  1.23a  V  and  to  be  1.34x10® 
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Appsndiz(l)  :  Derivation  of  Pl  in  strong  inversion 

At  strong  inversion,  the  potential  along  the  y-direction  (  vertical  )  can  be 
expressed  as  , 


dV  v 


(A1.1) 


where  <p  is  the  potential,  n*  is  the  intrinsic  carrier  concentration,  Lp  is  the 

r. 

intrinsic  Derbye  iength  and  Pr-'T^p  By  integrating  equation  (Al.l)  from  the 

fC  / 


surface  (  y=0  )  to  y,  we  obtain 


-gr»/2  ntV 


(A1.2) 


By  inserting  the  result  from  equation  (A1.2)  into  equation  (4),  we  obtain  equation 

(6). 
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Appendix  (2)  :  Derivation  of  PY  for  Short  Channel  Daviess 

In  short  channel  devices,  we  have  to  take  into  account  the  two  dimensional 
effects  in  order  to  achieve  reasonable  results.  This  can  be  accomplished 
approximately  by  separating  the  electron  concentration  (  n  in  equation(4)  )  into 
two  components,  namely  n«  and  n^.  The  component  n<,x  is  the  mobile  charge 
controlled  by  the  gate  while  the  component  n*  is  the  mobile  charge  controlled 
by  the  drain.  Therefore  equation(4)  becomes, 


f  nr>t(.y)e~v/xdy+  f  nd(y)e  ~v/xdy 
p  _  y-Q _ ygQ _ 

f  n(y)dy 
y*0 

The  denominator  in  equation(A2.l)  can  be  expressed  as, 


(A2.1) 


A rm„>i»  =  fn(y)dy  =  —£—  (A2.2) 

y=0  H"VSCl 

where  v ^  is  the  saturation  velocity  of  the  channel  electrons  and  W  is  the  device 


width.  The  charge  controlled  by  the  gate  can  be  appro:dmated  as 


N„  =  /  dy  =  -f-(  V„  -I'*)  (A2.3) 

y*0  V 

where  Cox*  is  the  gate  capacitance  per  unit  area  and  Oyj-l'di)  is  the  voltage 
across  the  oxide  at  the  drain  end.  Using  the  result  from  the  strong  inversion 
case  (  equation(6)  ).  the  first  integral  in  the  numerator  of  equation(A2. 1)  is 


/  n.,  (y  )e n'*dy  =  [l  -ae  ■£,(«)]  -  V* ) 

v*0  1  1  9 


(A2.4) 


If  we  approximate  the  charge  component  to  be  constant  from  y=0  to  a  criti¬ 
cal  depth  Ym,  then  the  second  integral  in  the  numerator  of  equation(A2.l) 

becomes. 


-  Kn  _r±_ 

f  n*(y)e~v/Kdy*  f  ni(y)e-v/ydy=nd\(i-e  *  ) 

y  *0  y*0 


(A2.5) 
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Ym  =  - 


N, 


mocUa  ■/vox 


-Na 


(A2.6) 


where  nd  =  -— -|  ( dEx / dx )  |  Em-  Combining  the  results  from  above,  we  have 


Pt  = 


*1 


~Y„ 


1  -ae“£’1(a)  Nox+nd\(  1-e  A  ) 


(A2.7) 


Nrrabilt 

In  the  case  when  Vgs£  7*.  we  shall  assume  that  at  the  drain  end,  the  drain 
controlls  all  the  mobile  charge  the  substrate  charge  (mobile  charge  no  longer  » 
than  substrate  charge)  and  the  charge  at  the  gate  that  are  associated  with  the 


C'cz 


drain  (ie.  (nd -N^) Ym  =  Nmobila+Nax  where  N„x  =  —^-(Vds~Vss  )  ).  Then  the  pro¬ 


bability  Pi  may  be  approximated  as 

X  — —  X 

Pl*±-(l-e  A  )*-£_ 

1  m  1  m 

The  oxide  field  at  the  drain  end  for  this  case  is  in  a  direction  that  opposes  the 
motion  of  the  injected  electrons.  The  probability  factor  P2  then  becomes 


(A2.8) 


-or 
.-o  *•* 


Pz=e  ""  (A2.9) 

The  silicon/silicon-dioxide  potential  barrier  is  increased  to  =3.2+( V^—Vg,)  (in 

volts). 


The  expression  for  Px  in  equation (A2. 7)  requires  the  knowledge  of  the 
channel-field  gradient.  A  reasonable  fit  to  Px  when  £ox^0  [14]  is 

5.86x10“® 


Pi  * 


1.45x10®  ^  (1  + 


2x10 


- - jr-p - +  2.5xl0“z 


-3 


(A2.10) 


1.3 


L+IJ 


) 


and  equals  to  2.5xl0“2  when  £tx<0. 
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Figure  Captions 

(1)  A  cross-sectional  view  of  the  MOSFET.  The  three  scattering  probabilities  in 
the  model  are  illustrated. 

(2)  A  schematic  illustration  of  the  injected  electron  in  the  potential-distance 
space.  The  lucky  electron  travels  a  distance  d  to  gain  the  energy  needed  to 
surmount  the  Si-SiOg  potential  barrier. 

(3)  An  illustration  of  the  electron  concentration  versus  the  depth  y  at  position 
x  in  the  channel. 

(4)  Measured  source  current  (/,),  substrate  current  (/„*),  and  gate  current 
(Igati)  versus  the  gate-to-source  voltage  (1^,)  for  a  1.3 fim  device  from  wafer 
A. 

(5)  Measured  Igait  versus  the  drain-to-sourcs  voltage  ( ’/^ )  for  the  devices  from 
wafer  A  (5a)  and  wafer  B  (5b).  (Vgs  =  10F  and  FIuS=0F) 

(6)  The  field-limited  case  (6a)  and  the  electrode-limited  case  (6c)  for  CKEI  are 
illustrated.  (6b)  shows  the  condition  where  'Kt*Ssl£*.  The  point  of  maximun 
CHE1  in  each  case  is  illustrated. 

(7)  The  dependence  of  Igalt  on  time  illustrating  the  effect  of  charge  trapping  on 
CHEI  gate  current. 

(8)  The  normalized  Igatl  versus  V&  for  the  data  shown  in  figure(5a).  The  solid 
(direct  numerical  integration  of  equation(lO))  and  the  dash  lines  (analytical 
solution  ie.  equation(12))  represent  theoretical  results. 

(9)  Constant  normalized  Igai,  contour  for  the  data  shown  in  flgures(5a&5b). 

(10)  Experimental  results  on  the  correlation  between  the  normalized  Ig and 
the  normalized  /„6  for  constant  Vg<i.  The  solid  lines  represent  theoretical 
results  based  on  equations(l4&15). 
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( 1 1)  Dependence  of  the  Si-SiO?  potential  barrier  and  ——on  Eox  at  the  drain 

end.  The  data  are  derived  from  the  experimental  results  shown  in 
flgure(lO). 


(12)  Comparison  between  experimental  results  and  the  calculated  normalized 
Igai*  using  the  correlation  between  the  gate  current  and  the  substrate 
current. 


(13)  Comparison  between  calculation  and  the  data  shown  in  figure(-i).  Numerical 
integration  based  on  equation(lO)  is  used  to  obtain  the  results  on  I3alt. 
Equation(13)  is  used  to  obtain  the  results  on  Itub. 

(14)  The  temperature  dependence  of  the  normalized  Igcit  and  the  normalized 

^rutr 


(15)  Data  shown  in  figure(l4)  is  used  to  obtain  the  ratio  of  the  normalized  Ijait 
on  temperature.  Data  obtained  from  Matsumoto  et  al.  [4]  is  also  plotted. 
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Abstract 

The  degradation  of  thin  gate  oxide  (— 100&)  n 
and  p-channel  MOSFETs  subjected  to  the  substrate 
hot  carrier  injection  is  discussed.  The  generation 
of  oxide  trapped  charges  is  observed  to  be  sub- 
linearly  dependent  on  the  applied  oxide  field, 
while  the  generation  of  interface  trapped  charges 
shows  a  linear  dependence  on  the  applied  oxide 
field.  The  generation  rates  are  found  to  be  a  func¬ 
tion  of  carrier  fluence  and  the  oxide  field,  and  are 
independent  of  the  injection  current  density.  The 

generation  of  interface  traps  correlates  well  with 
tie  mobility  and  subthreshold  current  degrada¬ 
tion.  An  oxide  field  around  5MV/cm  is  found  to  be 
a  critical  value  for  accelerating  device  degrada¬ 
tion.  There  is  no  significant  interface  trap  genera¬ 
tion  under  substrate  hot  hole  injection  Tor  the 
hole  fluence  up  to  ZXl017/cmz.  The  threshold  vol¬ 
tage  shifts  decrease  with  increasing  applied  sub¬ 
strate  bias.  Possible  mechanisms  are  discussed  to 
account  for  the  experimental  data. 


1.  Introduction 

Scaled  devices  for  NMOS  and  CMOS 
integrated-circuits  have  become  increasingly 
important.  As  the  channel  and  oxide  fields 
increase,  hot-carrier  emission  imposes  serious 
limitations  on  the  long-term  reliability  of  scaled 
VLSI  circuits.  Studies  of  device  degradation  to 
date  have  been  mostly  based  on  channel  hot- 
carrier  injection,  which  is  very  localized  in  nature. 
Consequently,  the  results  have  been  difficult  to 
interpret  and  generalize.  It  is  almost  impossible, 
for  example,  to  study  the  degradation  mechan¬ 
isms  with  this  injection  technique.  Substrate  hot 
carriers  can  be  uniformly  injected  throughout  the 
rfc«*nnel  ; 9  However,  this  technique  has  not 
been  useu  Jtudy  the  degradation  of  transcon¬ 
ductance,  channel  mobility,  subthreshold  current, 
etc.  This  paper  reports  the  device  degradations 
due  to  the  substeate  hot-carrier  Injection  into  thin 
gate  oxide  (-100A)  p  and  n  channel  MOSFETs. 


rubstcat 
—100 A) 


p  and  n  channel  MOSFETs. 


2.  Device  Preparation 

The  n-channel  and  p-channel  transistors  used 
in  our  Investigation  are  fabricated  with  the  stan¬ 
dard  silicon  gate  process.  The  starting  material 
are  all  <100>  substrates  with  substrate  resistivi¬ 
ties  iruthe  range  of  10-25  ohm-cm  The  gate  oxide 
(-100A)  is  grown  In  dry  oxygen  ambient  for  -18 
min.  and  followed  by  a  3-min.  nitrogen  anneal  at 
the  same  temperature.  The  polysuicon  gate  is 
arsenic  doped  to  avoid  dopant  diffusion  through 
the  thin  oxide.  The  channel  region  is  ion 


implanted  with  a  final  surface  concentration  of 
—ID17 /cm3.  The  drain  and  source  are  implanted 
with  arsenic  for  the  n-channel  devices  to  give  a 
final  junction  depth  of  ~0.7  pm  and  with  boron  for 
the  p-channel  devices  to  give  a  final  junction 
depth  of  —1.2  pm.  Special  process  steps  are  also 
taken  to  suppress  any  drain  (source)  to  gate 
current  leakage  as  a  result  of  drain  (source)  ion 
implant  damage  on  the  thin  oxide.  The  devices 
are  all  sintered  in  forming  gas  at  400°C  for  15  min. 
after  metallization. 

The  threshold  voltages  are  0.5V  and  -1.4V  for  n 
and  p-channel  devices  respectively.  Subthreshold 
current  slope  is  ~80mV/aecade.  Channel  mobili¬ 
ties  are  450  cmz/V-sec  for  electrons  and  170 
cmz/V-sec  for  holes.  The  drain  (source)  junction 
breakdown  voltage  is  ~18V  for  n*p  junction  and 
-32V  for p*n  junction  with  zero  gate  bias. 


3.  Experimental  Details 

The  experimental  configuration  used  for  sub¬ 
strate  hot-carrier  injection  is  shown  in  Fig.  1. 
Gate  bias  (V^)  and  substrate  bias  (VL^)  are 
separately  controlled  with  source  ana  drain 
grounded.  The  gate  bias  is  always  large  enough  to 
invert  the  surface  so  that  the  channel  is  near  the 

ground  potential.  The  minority  carriers,  electrons 
l  n-channel  devices  and  holes  in  p-channel  dev¬ 
ices,  are  injected  from  the  adjacent  forward- 
biased  p-n  junction  which  is  located  100pm  away. 
Most  of  the  minority  carriers  that  are  injected  will 
recombine  in  the  substrate  giving  rise  to  the  sub¬ 
strate  current.  However,  a  small  fraction  of  these 
injected  carriers  will  enter  the  deep  depletion 
.eg  on  anc,  ill  L  cccelerote-d  toward  the  surface 
inversion  layer.  Those  arriving  at  the  Si~Si02 
interface  with  sufficient  energy  to  surmount  the 
barrier  will  be  injected  in  the  Si02.  The  ffuence 
(number  of  carriers/unit  area)  of  injected 
carriers  Is  monitored  via  the  gate  current  meas¬ 
urement.  High  frequency  (H-F)  (1MHz)  C-V, 
quasi-static  C-V,  and  Fowfer-Noraheim  (F-N)  I-V 
characteristics  are  measured  before  and  after 
carrier  injection  to  determine  the  flat  band  vol¬ 
tage,  the  interface  trap  density  and  the  bulk  oxide 
charge  density  respectively.  The  drain  current  is 
measured  with  the  gate  voltage  varying  from  IV 
below  threshold  up  to  5V  at  a  constant  drain  vol¬ 
tage  of  50mV.  The  transistor  threshold  voltage 
and  transconductance  were  extracted  from  the 
measured  drain  current  data  using  the  current  to 
voltage  relationship  in  the  linear  region. 
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4.  Results  and  Discussions 

4. 1.  Substrate  Hot -Electron  Effect 

The  effect  of  hot  carrier  fluence  and  applied 
oxide  fleld  on  the  charge  trapping  and  therefore 
AKy  can  be  obtained  by  operating  a  device  at  the 
conditions  A.B.C  and  D  shown  in  rig.  2.  A.B  and  C 
have  approximately  the  same  gate  current,  but 
their  oxide  fields  are  4.  8  and  10  MV/cm  respec¬ 
tively.  D  has  the  same  oxide  field  (8  MV/cm)  as  B 
except  that  the  gate  current  is  5  orders  of  magni¬ 
tude  lower.  For  the  same  stressing  time  (1  hr  ), 
there  is  almost  no  observable  threshold  voltage 
shift  (AKy)  in  the  device  operated  under  condition 
D.  However,  AVy  gradually  increases  from  condi¬ 
tion  A  to  condition  C.  The  above  observation  indi¬ 
cates  that  the  charge  trapping  effect  is  a  function 
of  both  carrier  fluence  ana  oxide  field. 

Threshold  voltage  shift  versus  electron  fluence 
for  various  gate  bias  conditions  is  shown  in  Fig.  3. 
AVj.  increases  monotonically  with  increasing 
fluence  and  oxide  field  (E.M>5  MV/cm).  AVy  is 
approximately  the  same  ana  tends  to  saturate  for 
oxide  field  less  than  5MV/cm.  However,  a  linear 
increase  in  the  trapped  charge  density  due  to  trap 
filling  and  a  constant  trap  generation  rate  has 
been  observed  under  high  oxide  field  and  large 
fluence  [4j.  The  slopes  in  the  high  fluence  region, 
which  are  proportional  to  the  trap  generation 
rate,  are  shown  in  the  inset  of  Fig.  3.  The  gate 
current  under  the  same  substrate  bias  and  oxide 
field  can  be  controlled  by  varying  the  injecting 
current.  It  has  been  found  that  Aky  is  a  very  weak 
function  of  /_  over  the  range  0,1  nA  to  lriA.  i.e.. 
AVy  is  solely  dependent  on  the  amount  of  electron 
fluence  through  the  oxide  at  this  bias.  The  initial 
negative  value  of  AVy  for  high  gate  biases, 
corresponding  to  positive  charge  trapping,  has 
been  tentatively  identified  as  hole  trapping  near 
the  gate  interface.  The  trapped  holes  may  come 
from  the  electron-hole  pair  generation  in  the 
polysilicon  gate  due  to  high  energy  electron 
injected  from  the  substrate  or  from  impact  ioniza¬ 
tion  in  Si02.  The  trapped  holes  can  be  de-trapped 
or  neutralized  by  reverse  gate  bias  and  retrapped 
by  hot  hole  filling. 

The  threshold  voltage  shift  is  caused  by  the 
oxide  trapped  charge  and  the  interface  trapped 
charge.  The  effects  of  these  two  factors  can  be 
separated  via  comparison  of  the  positive  cate 
Fowler-Nordheim  VJj  shift(AV^)  an iT AVy”  That 
is.  for  first  order  consideration,  AV£is  insensitive 
to  the  interface-o-app.J  charge,  lhe  difference 
between  AVy  and  Aly.  therefore,  will  be  the  con¬ 
tribution  from  the  interface  trapped  charge 
(AV^u)  [5j.  Fig.  4  is  a  plot  of  the  voltage  changes 
at  a  constant  carrier  fluence  versus  tne  applied 
oxide  voltage.  corresponding  to  the  oxide 

trapped  charge,  decreases  at  high  gate  biases 
because  of  the  significant  hole  trapping  effect-. 
However,  the  generation  6 r~ interface  trapped 
charge  shows  a  linear  dependence  on  the  applied 
oxide  field.  An  oxide  field  around  5  MV /cm  is 
found  to  be  the  critical  value  for  accelerating  dev¬ 
ice  degradation.  The  generated  interface-trap  dis¬ 
tribution,  deduced  from  the  quasi-static  C-V  meas¬ 
urement,  is  similar  to  the  interface  traps  gen¬ 
erated  by  F-N  stressing  [5],  radiation  ionization 
[6],  and  internal  photoemission  [7],  Namely,  a 
characteristic  peak  centered  around  0  65  eV 


above  the  valence  band  edge  is  observed.  The 
interface  trap  density  gradually  decreases  for 
energies  below  the  peak  This  peak  state  .s  largely 
.responsible  for  the  mobility  and  subthreshold 
degradation.  The  interface-trap  generation  is 
much  more  significant  in  the  case- of  substrate  hot 
electron  injection  when  compared  with  F-N  injec¬ 
tion  where  the  carrier  is  essentially  in  thermal 
equilibrium  at  the  interface  The  generation  of 
interface  traps  and  fixed  charges  close  to  the 
Si—SiOz  interface  (AV^  in  Fig  4)  correlates  well 
with  the  mobility  and  subthreshold  current  degra¬ 
dation  which  is  shown  in  Fig  5. 

4.2.  Substrate  Hot-Hole  Effect 

The  p-channel  transistor  is  subjected  to 
stressing  similar  to  the  n -channel  case.  However, 
in  this  case  the  hot  carriers  are  holes,  and  higher 
substrate  bias  and  injection  current  are  necessary 
due  to  the  larger  ba'rier  height.  The  substrate 
hot  hole  current  from  the  substrate  depletion 
region  and  F-N  electron  tunneling  current  from 
the  gate  are  shown  in  Fig.  6.  The  steeper  slope  of 
the  substrate  hot  hole  current  versus  gate  bias 
comparing  with  substrate  hot  electron 
current(snown  in  Fig.  2)  is  probably  due  to  the 
large  number  of  scattering  events  within  the  bar¬ 
rier  lowering  region  because  of  the  extremely  low 
mobility  of  holes  in  SiOz.  The  stressing  conditions 
are  chosen  with  the  gate  bias  less  than  8V  to  avoid 
significant  electron  tunneling  Hole  trapping  can 
be  detected  from  the  transistor  negative  AVy  as 
Well  as  negative  F-N  1-V  shifts.  It  has  been  found 
that  the  subthreshold  slope  suffers  no  observable 
degradation  up  to  ZXl017/cm2of  injected  hot 
holes.  By  measuring  the  same  transistor  thres¬ 
hold  voltage  shift  with  the  H-F  C-V  flatband  voftage 
ahift,  one  can  conclude  no  significant  interface 
trap  generation  above  the  midgap.  This  conclu¬ 
sion  is  then  doubfy~eheCked-by  the-quasi-static  C- 
V  measurement  which  showed  very  low  interface- 
trap  density  after  hot  hole  injection.  This  indi¬ 
cates  no  significant  interface-trap  generation  in 
the  band  gap. 


4.3.  Substrate  Bias  Effect 

The  threshold  voltage  shift  is  found  to  be  a 
strong  function  of  substrate  bias,  which  is  shown 
in  Fig.  7.  Namely,  less  degradation  is  observed  at 
higher  substrate  bias,  ana  it  tends  to  saturate  at 
lower  substrate  bias  The  subthreshold  current 
slope  (AS),  however,  is  shown  to  be  a  weak  func¬ 
tion  of  Vgut-  Both  A  Vj  and  A 5  are  strong  function 
of  applied  gate  bias  in  the  substrate  hot-electron 
injection  case. 


Three  possible  mechanisms  have  been  con¬ 
sidered  to  account  for  the  experimentally 
observed  substrate  bias  dependence.  (1)  impact 
ionization  in  the  silicon  depletion  region,  followed 
by  subsequent  hot  carrier  injection  over  the 
5*02 -Si  barrier,  (2)  impact  ionization  in  SiOz  and 
subsequent  trapping  of  carriers  and  (3)  carrier 
energy  dependence  of  trap  capture  cross  section 
While  in  general  ail  three  processes  may  take 
place  in  the  above  experiments,  some  processes 
may  dominate  over  the  others,  depending  on 
experimental  conditions. 


For  the  p-channel  device  shown  in  Fig  7, 
impact  ionization  caused  by  injected  holes  can 
occur  in  the  silicon  depletion  region  due  to  the 


large  substrate  bias  Some  of  the  impact  ionized 
hot  electrons  will  have  sufficient  energy  to  over- 
come  the  oxide  barrier  and  get  injected  into 
Si02,  despite  the  negative  gate  voltage  [8],  A  por¬ 
tion  of  the  injected  electrons  can  be  captured  by 
the  trapped  holes  with  relatively  high  probability, 
due  to  the  large  capture  cross  sections  of  the 
coulombic  traps.  This  results  in  a  charge  compen¬ 
sation  effect,  which  leads  to  a  reduction  of  AVjr. 
Since  the  degree  of  impact  ionization  increases 
with  the  substrate  bias,  the  end  result  is  the 
observed  decrease  in  &VT  with  increasing  sub¬ 
strate  bias. 

For  the  n-channel  device,  although  impact  ion¬ 
ization  in  the  silicon  depletion  region  can  also 
occur,  its  degree  is  substantially  lower  due  to  the 
much  lower  substrate  bias.  In  addition,  hole  injec¬ 
tion  into  Si02  in  this  case  is  much  less  likely  [8],. 
probably  due  to  the  higher  barrier  for  the  hole 
injection.  However,  for  the  n-channel  device  at 
hikh  oxide  field  ( Vt=8V,  corresponding  to 
~8MV/cm),  impact  ionization  in  the  StC>2  layer 
may  take  place  due  to  hot  electrons  emitted  from 
the  substrate,  which  leads  to  net  hole  trapping 
[lOj.  causing  a  reduction  in  AKr.  This  effect  is 
more  pronounced  for  higher  substrate  biases, 
since  the  average  energy  of  the  hot  electrons 
entering  Si02  is  higher,  which  is  consistent  with 
the  observed  substrate  bias  effect. 

For  n-channel  devices  with  low  oxide  field  { Vf 
=3V.  corresponding  to  ~3MV/cm).  impact  ioniza¬ 
tion  in  Si02  is  much  less  likely.  In  this  case,  the 
carrier  energy  dependence  of  the  oxide  trap  cap¬ 
ture  cross  section  may  play  a  dominant  role.  As 
the  energy  of  the  injected  electron  increases,  the 
capture  probability  decreases  [  1 1],  which  can 
explain  the  substrate  bias  dependence  observed 
here.  An  alternative  interpretation  is  that,  as  the 
energy  of  the  injected  electrons  increases,  the 
centroid  of  the  trapped  electrons  will  move  away 
form  the  Si02/Si  interface,  which  will  be 
reflected  in  the  external  measurement  as  a 
smaller  bVT. 

&  Other  Discussions 

(1)  Threshold  voltage  shift  as  a  function  of 
substrate  bias  has  been  studied  by  using  nona¬ 
valanche  injection  technique  (lj.  At  first  glance, 
it  appears  that  the  results  snown  here  are  con¬ 
tradictory  to  those  reported  in  Ref.  [l]  which 
showed  that  dFr  increases  with  increasing  V^. 
The  discrepancy  arises  because  a  fixed  stressing 
time  was  used  in  [  1  ]  while  a  fixed  carrier  fluence  is 
used  in  this  study.  We  believe  carrier  fluence  is  a 
more  fundamental  parameter,  which  offers  more 
useful  comparisons  for  the  studies  of  device 
degradation. 

(2)  The  interface  trap  generation  caused  by 
substrate  hot-hole  injection  in  p-channel  transis¬ 
tors  is  less  than  that  caused  by  the  substrate  hot- 
electron  injection  in  n-channel  transistors  for  the 
same  earner  fluence.  This  is  different  from  the 
case  of  avalanche  hole  injection  [  12]. 

(3)  For  large  V^,  it  appears  that  the  AFr 
degradation  caused  by  substrate  hot-hole  injec¬ 
tion  is  less  than  the  degradation  caused  by  sub¬ 
strate  hot-electron  injection  For  low  V^,  how¬ 
ever.  the  substrate  hot-hole  injection  causes  much 
more  significant  A  VT  degradation  This  is  shown  in 
Fig.  8 


(4)  Due  to  the  strong  correlation  of  A  Ft  with 
and  Vq  as  shown  in  Fig  7,  it  is  possiole  to 
choose  an  optima]  set  of  operating  parameters  for 
y**  and  Vg  to  achieve  the  lowest  and  inter¬ 
face  trap  generation. 
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1.  Introduction 

Floating-gate  structures  which  utilize  charge 
tunneling  through  thin  oxides  are  now  widely  used 
for  making  electrically  erasable  programmable 
read-only  memories,  or  EEPR0M's[l].  The  future 
trend  for  this  type  of  non-volatile  memory  design 
is  likely  moving  toward  high  packing  density  with 
thinner  oxides  (<100A)  as  the  tunnel  barrier.  It  is 
important,  therefore,  to  recognize  all  relevant 
effects  associated  with  charge  tunneling  through 
thin  oxide  in  the  MOS  system.  In  particular,  we 
need  to  understand  the  tunneling  characteristics 
of  thin  oxides  so  that  these  devices  are  sealed 
properly  without  suffering  performance  penalty. 
In  this  paper,  theoretical  modeling  and  experi¬ 
mental  results  on  direct  tunneling,  tunneling- 
induced  electron-hole  pair  generation  in  silicon, 
and  hole  tunneling  are  presented.  These  studies 
not  only  lead  us  to  better  understand  the  funda¬ 
mental  limits  in  these  devices  but  also  provide  us 
with  some  insights  in  the  physical  properties  of 
the  Si—Si02  materials. 

2.  Direct  vs.  Fowler-Nordheim  Tunneling 

Charge  retention  Is  a  very  Important  con¬ 
sideration  in  the  performance  of  non-volatile 
memory.  Typically,  a  minimum  of  11  order-of- 
magnitude  difference  in  current. levels  is  required 
between  programming- and  read  operations.  This 
can  readily  be  achieved  In  the  thicker  oxides 
owing  to  the  exponential  curren* -voltage  relation¬ 
ship  of  Fowler-Nordheim  tu.  .iir.g  (Fig. la).  In 
thinner  oxides  (<60\J,  however,  direct  tunneling 
(Fig.  lb)  can  become  important  at  low  voltages[2f 
The  trapezoid-shaped  oxide  barrier  gives  rise  to  a 
tunneling  current  which  has  relatively  weak 
dependence  on  oxide  field.  Using  the  semi- 
classical  independent  electron  approach  with  a 
Franz-type  two-band-like  dispersion  relation  in  the 
"WKB  approximation  of  tunneling  probability,  we 
have  modeled  the  tunneling  current-voltage  rela¬ 
tionships  for  a  wide  range  of  oxide  thicknesses. 
Some  of  the  theoretical  results,  together  with 
their  experimental  data,  are  illustrated  in  Fig. 2 
for  electron  tunneling  from  metal  gate  to  a  p-type 
substrate.  From  Fig  2,  it  is  apparent  that  the 


current  changes  much  slower  as  a  function  of  gate 
voltage  for  oxide  voltage  below  the  barrier-height 
potential(3.2V).  This  direct  tunneling  phenomenon 
has  to  be  considered  in  scaling  the  oxides  so  that 
no  excessive  charges  can  leak  away  during  the 
read  cycles. 

3.  Electron-Hole  Pair  Generation  In  S 

Although  those  electrons  participating  in  tun¬ 
neling  at  the  cathode  are  ”cold'\  they  do  become 
"hot"  when  arriving  at  the  anode,  with  a  maximum 
possible  energy  equal  to  the  potential  drop 
between  the  two  electrodes  Upon  entering  the  Si, 
these  energetic  electrons  will  lose  energy  by  pho¬ 
non  scattering  and  by  impact  ionization[3,4].  In 
the  erase  operation  of  a  memory  cell,  therefore, 
electron-hole  pairs  are  generated  in  the  n*  region 
underneath  the  tunnel  oxide.  Some  of  these  gen¬ 
erated  holes  will  drift  to  the  Si-SiOz  interface 
and  flow  out  into  the  p*  channel  stop  region  to 
become  a  substrate  current.  This  current  can  be 
comparable  to  the  tunneling  electron  current  In 
magnitude  and  is  generally  not  desirable  in  circuit 
operation. 

In  this  work,  Si-gate  p-channel  MOS  transistors 
are  used  for  studying  the  carrier  multiplication 
effect  (Fig.3).  Our  experimental  set-up  is  the 
same  as  previous  investigator's[5,6]  in  which  a 
negative  voltage  ramp  Is  applied  to  the  gate  elec¬ 
trode  and  the  gate,  drain/source  and  substrate 
currf,r>*.s  ere  measured.  Under  most  conditk  '.s, 
the  gate  current  (/„)  is  purely  the  electron  tun¬ 
neling  current  ana  the  drain/source  current 
(/*_,)  Is  the  generated  hole  current  (flowing  out  of 
p*  drain/source),  whereas  the  substrate  current 
(Sjuft)  the  combination  of  the  tunneling  electron 
current  and  the  generated  electron  current. 
Therefore,  the  ratio  /Ig  represents  the  quan¬ 
tum  yield  (y),  or  number  of  generated  electron- 
hole  pairs  per  incident  electron  for  a  given  bias. 
The  quantum  yields  vs.  oxide  voltage,  which  is 
"“1.3V  less  than  the  gate  voltage,  are  plotted  in 
Fig. 4  for  several  different  oxides.  It  is  interesting 
to  observe  that  there  exists  a  threshold  (~1.7V) 
for  pair  generation  and  the  maximum  quantum 


yield  is  less  than  2,  irrespective  of  the  oxide  thick¬ 
ness  and  applied  voltage.  The  threshold  effect  at 
~1.7  volts  is  a  direct  manifestation  of  conservation 
of  momentum  in  silicon[4]  in  which  it  requires  a 
minimum  of  3/2  of  the  bandgap  energy  for  the 
incident  electron  to  make  impact  ionization.  The 
experimental  observation  that  the  quantum  yields 
are  relatively  insensitive  to  the  applied  voltages 
for  thicker  oxide  (>100A)  samples  is  indictive  of 
strong  lattice  scatterings  in  the  insulator.  It  is 
believed  that  hot  electrons  lose  energy  and  get 
randomized  in  Si02  primarily  by  LO  phonon  emis¬ 
sion^. 6].  Assuming  the  energy  distribution  of 
these  hot  electrons  is  a  displaced  Maxwellian,  one 
can  derive  an  energy  conservation  equation[7,8] 
for  the  transport  of  the  "average"  electron  in 
Ei02.  This  phenomenological  equation  is 
expressed  in  Eq.  1. 

<**.  _ES_ 

dx  ~qE°*  A 


(1) 


where  E,  is  the  average  electron  energy,  En  is 
the  oxide  field  and  A  is  the  empirical  energy  relax¬ 
ation  mean-free-path,  respectively.  The  term  on 
the  left-hand  side  of  Eq.l  represents  the  net 
energy  increase  per  unit  distance.  The  first  term 
on  the  right-hand  side  represents  the  energy  gain 
from  the  electric  field  and  the  second  term 
represents  the  energy  loss  due  to  collision.  The 
solution  to  the  above  first-order  linear  differential 
equation  is  given  by  Eq.2. 


E,(x)  =  qEoz\h-e  x 


*-*6  .s„<x<taa(2) 


where  is  the  oxide  thickness,  sM  is  the  tunnel¬ 
ing  distance  in  oxide  and  is  the  Si—Si02  bar¬ 
rier  height(32eV).  The  average  energy  of  these 
electrons  in  exiting  the  oxide  is  simply  Et(x=tgx). 
Note  that  Eq.2  is  only  good  for  g.V^>4>6  where 
scatterings  of  electrons  in  the  conduction  band  of 
Si02  have  occurred.  In  direct  tunneling,  on  the 
other  hand,  no  scattering  on  the  tunneling  elec¬ 
trons  is  assumed  and  the  energy  gained  In  exiting 
the  oxide  is  equal  to  the  oxide  potential  drop. 

Eq.2  says  that  Et(tos)  ir  only  a  function  of  the 
oxiJi  .'eld,  provided  I'.V't  v.,  (or,  the  distance 
that  the  electrons  spent  in  the  conduction  band  of 
Si02 )  is  greater  than  A.  The  experimental  data  on 
quantum  yield  are  re-plotted  in  Fig. 5  as  a  function 
of  oxide  fields.  It  can  be  seen  that,  in  thicker 
oxide  samples,  the  quantum  yields,  which  are 
directly  related  to  the  incident  electron  energies, 
are  essentially  not  thickness  dependent,  in  agree¬ 
ment  with  our  theoretical  analysis. 

To  further  extend  this  work,  we  have  used 
Eq.2.  together  with  a  A  value  of  30  Angstroms,  to 
calculate  the  average  energy  of  electrons  incident 
at  the  Si  surface.  Quantum  yields  are  re-plotted 
in  Fig. 6  as  a  function  of  the  calculated  average 
electron  energy  (relative  to  the  conduction-band 


edge  of  silicon)  It  compares  favorably  with  the 
theoretical  work  on  ionization  probability 
reported  by  Drummcnd  and  Moll[4]  in  the  energy 
range  of  less  than  ~3,5eV.  It  is  interesting  In 
observing  an  "inflection"  in  curvature  around  4eV 
This  phenomenon  is  probably  related  to  the 
detailed  energy-band  structure  in  the  silicon  The 
higher  yield  exhibited  by  the  79A  oxide  sample,  of 
which  the  thickness  is  comparable  to  A,  could  be  a 
manifestation  of  the  ballistic  effect[8]  in  SiOz 
Although  this  needs  to  be  further  investigated 

4.  Hole  Tunneling 

Faraone  and  Hsueh[9],  by  using  a  structure 
incorporating  a  p-channel  MOSFET  with  a 
metal/ tunnel-oxide /n-silicon  device,  have  demon¬ 
strated  that  the  dominant  carrier  transport  in  20A 
or  thinner  Si02  (Al-gate)  is  hole  flow  under  nega¬ 
tive  bias.  In  our  work  on  quantum  yield,  we  also 
observed  hole  carrier  transport  through  very  thin 
oxides  (<40A)  in  the  voltage  range  where  carrier 
multiplication  is  not  significant.  In  Fig. 7,  We  show 
the  gate  and  drain-source  currents  for  p-channel 
MOSFET's  of  oxide  thicknesses  35A  and  41  A.  The 
p-n  diode  leakage  current  of  these  transistors  is 
~10  femtoamps.  For  the  35A  oxide  sample,  the 
drain-source  current  is  initially  negative  (flowing 
into  the  drain /source),  indicating  hole  carriers 
either  transporting  through  the  oxide  or  recom¬ 
bining  with  tunneling  electrons  at  th^  Si  surface. 
For  Vg>-3V  (or.  V,0l>-1.7V),  appreciable  holes  are 
generated  in  the  Si  channel  depletion  region  by 
energetic  tunneling  electrons  to  cause  a  decrease 
in  the  drain-source  current.  Id_t  changes  sign  at 
Vg~-3.2V  and  stays  positive  beyond  this  voltage. 
Notice  that  the  gate  current  is  about  one  order  of 
magnitude  larger  than  the  drain-source  current 
before  significant  pair  generation  takes  place. 
This  is  a  good  indication  that  electron  current  is 
still  the  dominant  carrier  species  in  the  transport 
of  35A-thick  Si02  (poly-gate).  The  41A  oxide  data 
are  qualitative  the  same,  except  that  the  hole  tun¬ 
neling  current  is  much  smaller. 

For  Si-gate  n-channel  MOSFET  s,  several 
authors[l0,ll,12]  have  observed  substrate  hole 
current  in  thin  and  thick  ox.de.  .  .iple...  under 
sufficiently  high  positive  gate  bias.  They  have 
attributed  this  current  to  tunneling  by  certain 
kinds  of  valence-band  electrons  that  leave  holes 
behind,  giving  rise  to  a  substrate  current.  We 
have  measured  the  substrate  currents  as  well  as 
the  gate  currents  for  devices  with  different  oxide 
thicknesses  and  plot  their  ratios  (/-  //„,*)  in  Fig  8 
In  the  very  thin  oxide  data,  a  thresnold  at  V^~l.lV 
for  the  onset  of  hole  current  can  be  observed. 
■This  may  correspond  to  the  situation  where  the 
|valence-band  edge  at  the  Si  surface  is  in  line  with 
|the  conduction-band  edge  of  the  n*  poly-gate. 
Hence,  as  the  gate  voltage  is  Increased  further, 
there  will  be  empty  states  available  on  the  gate 
side  for  the  Si  valencp-band  electrons  to  tunnel 


into.  The  ratio  (y^ //„,<,)  increases  from  <100  at 
low  bias  to  >1000  at  medium  bias  and  then 
decreases  at  high  bias.  However,  our  theoretical 
simulation  based  on  a  2-band  model  predicts  a 
(Ig/Isub)  ratio  generally  larger  than  10,000  for 
medium  and  high  biases  if  only  the  valence-band 
electron  tunneling  is  considered.  This 

discrepancy  points  to  the  possibility  of 
mechanisms  other  than  pure  valence-band  elec¬ 
tron  tunneling  may  be  involved.  The  contributions 
from  tunneling  by  hot  holes  generated  at  the 
polysilicon  surface  and  from  tunneling  by  field- 
enhanced  excitation  of  electrons  from  the  Si 
valence  band[l2]  should  be  carefully  analyzed  in 
order  to  determine  the  exact  origin  of  this 
observed  substrate  current. 

&  Summary 

We  have  investigated  the  carrier  tunneling 
related  phenomena,  namely,  direct  tunneling,  lat¬ 
tice  scattering  and  field  heating  in  oxide,  carrier 
multiplication  in  silicon,  hole  and  valence-band 
electron  tunneling  in  thin-oxide  MOSFETs.  It  is 
found  that  direct  tunneling  imposes  a  fundamen¬ 
tal  limit  on  oxide-thickness  scaling.  The  average 
energy  of  the  electrons  in  the  oxide  conduction 
band  is  primarily  a  function  of  oxide  field  as  a 
result  of  strong  lattice  scattering.  An  impact  ioni¬ 
zation  threshold  of  ~1.7eV  is  observed  in  Si  and  a 
quantum  yield  less  than  2  is  measured  for  all  the 
samples.  Finally,  hole  tunneling  is  observed  in 
both  Si-gate  p-channel  and  n-channel  devices. 
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Fig.  la  Energy-band  diagram  showing  Fowier- 
Nordheim  tunneling  (triangular  barrier,  VBI  >  $„) 
from  the  p-type  substrate  to  the  gate  of  a  UOS  struc¬ 
ture. 


Fig.  lb  Energy-band  diagram  showing  Direct  tunneling 
(trapezoidal  barrier,  V  <♦*)•„  Electrons  do  not  go 
into  the  conduction  band  of  the  SiO%  in  this  process. 
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Fig. 2  Theoretical  and  expert'  enlal  tunneling  I-V*s  of 
Al-gate  n-channel  UOS  structures  under  negative  gate 
bias,  illustrating  the  difference  in  i-V  characteristics 
between  F-N  anaDirect  tunneling. 
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Abstract 

Electrical  breakdown  of  thin  (32  nm)  Si02  films  subjected  to  constant- 
current  stressing  is  studied.  By  studying  the  effects  of  reversing  the  polarity  of 
the  constant-current  bias  and  the  effects  of  thermal  annealing  on  the  charge- 
to-breakdown  it  is  determined  that  electrical  breakdown  of  SiC^  is  not  caused  by 
the  widely-cited  accumulation  of  trapped  electrons.  Rather  it  is  caused  by  the 
build-up  of  positive  charges  near  the  cathode  at  localized  areas.  The  positive 
charges  are  not  mobile  ions  but  exhibit  many  characteristics  of  trapped  holes. 
We  conclude  that  electrical  breakdown  in  SiOg  is  caused  by  the  accumulation  of 
holes,  generated  by  impact  ionization  in  the  oxide. 


t  On  leave  from  Yale  University. 


1.  Introduction 


Time-dependent  dielectric  breakdown  of  SiOj  may  be  divided  into  two 
stages.  The  first  is  a  build-up  stage  during  which  localized  high-fleld/current- 
density  regions  are  formed.  The  brief  second  stage  begins  during  which  runaway 
electrical  and  or  thermal  processes  quickly  bring  the  oxide  to  breakdown.  The 
time  required  to  complete  the  build-up  stage,  of  course,  determines  the  lifetime 
of  the  oxide. 

Three  basic  physical  models  for  the  build-up  stage  are  known  to  us  [1-5]. 
According  to  one  model  [l],  positive  impurity  ions,  such  as  Na*  ,  migrate  to  the 
cathode  interface  under  the  influence  of  the  field,  resulting  in  an  increased  local 
field  and  reduced  barrier  height  at  the  cathode  (Figure  la).  A  second  model  [2- 
4]  is  similar  to  the  first  except  for  postulating  that  the  trapped  positive  charge 
results  from  hole  generation  by  impact  ionization  in  the  oxide  (Figure  lb).  A 
third  and  widely-cited  model  [5]  suggests  that  electron  trapping  causes  an 
increase  in  the  electric  field  at  the  anode  (for  a  fixed  stressing  voltage)  (Figure 
lc).  Breakdown  results  when  the  field  reaches  a  critical  value  at  which  the  Si-0 
bond  is  broken.  However,  conclusive  experimental  verification  for  these 
different  models  has  been  lacking. 

This  paper  presents  new  experimental  evidence  which  supports  the  hole- 
build-up  model  and  contradicts  the  other  two  models. 

2.  Experimental  Results  and  Discussion 

The  samples  used  in  this  study  were  polysilicon-gate  MOS  capacitors.  10-20 
0— cm  p-type  substrates  were  used.  The  gate  oxide  was  grown  to  ®  thickness  of 
approximately  32  nm  at  900C  in  dry  02,  followed  by  a  10  minute  anneal  in  nitro¬ 
gen.  Polysilicon  was  then  deposited  and  implanted  with  arsenic.  The  arsenic 
was  driven  in  at  1000C  in  dry  02  for  1  hour.  The  post-metallization  anneal  was 
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performed  at  450C  for  15  minutes  in  forming  gas. 

The  capacitors  were  biased  with  a  constant  current  to  accelerate  break¬ 
down.  This  technique  was  first  described  by  Harari  [5].  The  use  of  a  constant- 
current  stress  facilitates  the  measurement  of  the  "charge-to-breakdown"  (the 
amount  of  charge  passing  through  the  oxide  before  breakdown).  There  is  no 
difference  in  the  physics  of  oxide  breakdown  caused  by  constant-voltage  stress 
and  by  constant-current  stress.  The  current  flow  in  Si02  is  due  to  Fowler- 
Nordheim  tunneling,  which  has  an  1-V  characteristic  given  by 

zS. 

I  -  A  Ecz  e  Ec  (1) 

where  A  and  B  depend  on  the  electron  effective  mass  and  the  barrier  height  at 
the  injecting  interface,  i.e.,  the  cathode.  Ec  is  the  cathode  electric  field. 

A  small  fraction  of  the  electrons  injected  into  the  Si02  are  subsequently 
trapped.  Because  the  constant-current  bias  maintains  a  constant  cathode  field  ( 
Eq.  1  ),  the  gate  voltage  necessary  to  maintain  a  constant  current  increases  as 
the  density  of  trapped  electrons  increases. 

Figure  2  shows  typical  IV  characteristics  of  a  device  before  and  after  being 
subjected  to  a  constant-current  stress  with  the  gate  biased  negative.  Figure  2a 
shows  the  IV  curves  for  positive  Vg  (first  quadrant  in  the  IV  plane).  After  stress¬ 
ing,  the  IV  characteristics  are  shifted  to  the  right,  which  is  indicative  of  electron 
trapping  in  the  oxide.  Also  shown  in  Rgure  2a  are  the  IV  characteristics  of  the 
stressed  device  measured  after  annealing  the  device  in  forming  gas  at  450C.  As 
can  be  seen,  the  eflect  of  the  anneal  is  to  remove  the  negative  charge  from  the 
oxide.  The  post-anneal  IV  characteristic  is  nearly  identical  to  its  original  value. 
In  Figure  2b,  the  third  quadrant  IV  characteristics  (negative  Vg,  Ig  )  show  a 
dramatic  reduction  in  slope  due  to  stressing.  The  450C  anneal  shifted  the  IV  to 
the  left  by  removing  the  trapped  electrons  as  expected.  However,  the  slope 
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remains  reduced,  and  the  current  is  increased  or  shifted  to  the  left  when  com¬ 
pared  to  the  original  (before  stressing)  IV.  The  direction  of  the  shifts  in  the  TV 
curves  is  consistent  with  positive  charge  trapping.  Furthermore,  slope  reduc¬ 
tion  and  current  increase  in  the  Fowler-Nordheim  characteristics  can  be  attri¬ 
buted  to  cathode  barrier  lowering  due  to  trapped  positive  charge  near  the 
cathode,  i.e.  the  gate  [6].  Therefore  we  conclude  that  positive  charge  accumu¬ 
lates  near  the  gate,  i.e.  the  cathode  during  and  as  a  result  of  stressing.  The 
positive  charging  can  be  shown  to  be  localized  at  a  small  fraction  of  the  area  [7], 
where,  perhaps,  hole  trap  density  is  higher.  Hole  trapping  near  the  anode  (the 
substrate)  is  not  sufficient  to  cause  significant  barrier  lowering  as  the  IV  slope  is 
little  changed  after  stressing  in  Figure  2a.  The  effect  of  the  annealing  step  was 
to  remove  the  trapped  negative  charge,  while  a  large  portion  of  the  positive 
charge  remains  trapped  near  the  cathode. 

The  effect  of  this  residual  positive  charge  on  the  breakdown  was  then  stu¬ 
died.  Devices  were  stressed  to  approximately  90%  of  the  total  charge  per  unit 
area.  Q  -  Jx-t,  necessary  for  breakdown  for  that  particular  polarity,  where  J  is 
the  current  density,  and  t  is  the  stressing  time.  A  current  density  of 
33  mA/ cm2  was  used  in  this  experiment.  The  devices  were  then  annealed  in 
forming  gas.  Annealing  temperatures  of  350C  and  450C  were  used.  Then,  the 
additional  charge  Qbd  necessary  for  breakdown  was  measured.  The  results  are 
shown  in  Table  1.  From  rows  03  and  5,  one  observes  that  if  the  same  polarity 
current  is  employed  after  annealing,  then  only  a  similar  small  amount  of  addi¬ 
tional  charge  is  necessary  for  breakdown,  whether  the  devices  are  annealed  at 
either  temperature  or  not  annealed  at  all  after  the  initial  stressing.  Al«o,  the 
total  charge-to-breakdown  Qtoiai  =  Qi  +  Qbd<  is  essentially  the  same  in  rows  3 
and  2  and  in  rows  5  and  1. 


By  these  observations,  we  rule  out  electron  trapping  as  the  cause  of  the 
breakdown.  If  the  trapped  negative  charge  was  responsible  for  the  breakdown, 
then  one  would  expect,  in  row  3  or  5,  that  the  Qbd  after  annealing  would  be 
much  larger  than  the  Qbd  without  annealing  since  annealing  had  removed  the 
electrons  trapped  during  the  initial  stress  and  restored  the  oxide  to  the  original 
condition.  We  conclude  that  oxide  breakdown  is  due  to  a  build-up  of  positive 
charge  which  is  not  annealed  out  at  temperatures  as  high  as  450C  (see  Fig.  2b). 
In  fact,  hardly  any  positive  charge  is  detrapped  at  450C  as  evidenced  by  the 
insensitivity  of  Qbd  to  annealing  in  all  the  rows.  Table  1  also  shows  that  there 
exists  an  order  of  magnitude  increase  in  Qbd  when  the  final  stressing  polarity  is 
reversed  from  the  initial  stressing  polarity  (row  4  vs.  row  3,  and  6  vs.  5).  The 
higher  Qbd  due  to  the  reversal  of  the  polarity  of  the  stress  (without  the  thermal 
anneal)  is  more  clearly  illustrated  in  Figure  3.  Without  the  polarity  reversal, 
Qotai  =  Qi  +Qbd  =  constant  as  expected.  With  the  polarity  reversal.  Qbd  is 
essentially  independent  of  ft.  This  supports  the  model  that  positive  charge 
builds  up  near  the  cathode  during  stress.  The  final  Qbd  is  limited  by  the  build¬ 
up  of  positive  charge  near  the  final  cathode  during  the  final  stress  and  is  insensi¬ 
tive  to  the  positive  charge  trapped  near  the  initial  cathode  during  the  initial 
stress.  This  results  suggests  that  extrapolation  of  device  lifetime  based  on 
charge-to-breakdown  tests  with  a  single-polarity  current  stress  is  slightly  pes¬ 
simistic. 

Another  point  worth  noting  is  the  significant  difference  in  the  Qbd  observed 
when  one  compares  injection  from  the  polysilicon-gate  electrode  to  that  of  injec¬ 
tion  from  the  silicon  substrate.  Injection  from  the  substrate  results  in  a 
significantly  larger  Qbd .  as  shown  in  Figure  3  and  Table  1.  Similar  results  have 
been  reported  previously  [8].  This  effect  could  be  explained  in  terms  of  a  more 
efficient  trapping  of  the  generated  positive  charge  near  the  polysilicon-  Si02 
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interface  when  compared  to  the  Si— SiOj  interface,  or  in  terms  of  a  weaker 
polysilicon-  SiOj  interface  such  that  the  final  runaway  stage  begins  at  a  lower 
density  of  trapped  positive  charge.  In  fact,  the  well  known  localized  nature  of 
oxide  breakdown  (the  weak  spots)  can  be  understood  in  these  terms,  too. 

In  order  to  rule  out  mobile  ion  contamination  as  the  cause  of  the  break¬ 
down,  several  devices  were  stressed  to  14  C/  cm 2  in  one  polarity,  an  additional 
14C/cm2  in  the  opposite  polarity,  and  then  stressed  to  breakdown  in  the  origi¬ 
nal  polarity.  The  final  Qbd  was  always  of  the  order  of  1  C/  cm2.  The  fact  that  the 
device  breaks  down  quickly  in  the  third  stress  suggests  that  the  positive  charge 
that  is  trapped  at  the  original  cathode  basically  remains  trapped  there.  If 
mobile  ions  were  the  positive  charge  responsible  for  the  breakdown  then  one 
would  expect  a  larger  Qqd  at  the  third  step  because  the  ions  accumulated  dur¬ 
ing  the  first  stress  would  have  been  moved  away  (from  the  cathode)  by  the 
second  stress.  Certainly  the  presence  of  mobile  ions  such  as  Na+  in  large 
amounts  in  the  oxide  will  accelerate  the  breakdown  process,  as  has  been  shown 
several  times  [1,9-11].  What  this  experiment  proves  is  that  the  minute  concen¬ 
trations  of  Na*  present  in  today's  MOS  processes  is  not  responsible  for  oxide 
breakdown. 


behavior  of  the  trapped  positive  charge  to  be  similar  to  that  of  radiation- 
induced  positive  charge  [13].  As  is  well  known,  ionizing  radiation  results  in  the 
formation  of  hole-electron  pairs  in  the  Si02,  with  the  created  holes  being 
trapped  at  the  interfaces  with  a  fairly  high  probability.  Thus  this  similarity  in 
behavior  of  positive  charge  generated  by  ionizing  radiation  and  high-field  stress¬ 
ing  offers  further  support  for  impact  ionization  in  the  Si02.  For  example,  Figure 
4  shows  some  typical  results  of  the  annealing  of  the  damage  caused  by  the 
constant-current  injection.  The  quasi-static  CV  measurements  of  a  device 
stressed  to  +18  C/cm2  and  then  successively  annealed  at  temperatures  of  350C 
and  450C  are  compared  with  the  characteristics  of  the  device  before  stress. 
After  the  first  anneal,  the  CV  characteristics  are  shifted  to  the  left  indicating  the 
presence  of  residual  positive  charge.  In  addition,  fast  surface  states  are 
present.  After  annealing  at  450C,  the  positive  charge  density  is  reduced  and  the 
surface  states  are  removed.  This  type  of  behavior  is  consistent  with  reports  of 
the  annealing  of  radiation-induced  trapped  holes  at  the  Si02  interface  [13-15], 

3.  Conclusion 

Experimental  evidence  has  been  presented  to  rule  out  two  previous  models 
of  breakdown  in  Si02  ,  the  electron-trapping  model  and  the  mobile-ion  model. 
The  breakdown  is  due  to  a  build-up  of  positive  charge  at  the  cathode  interface  in 
localized  areas.  This  positive  charge  is  probably  holes  generated  by  impact  ioni¬ 
zation  in  the  oxide  that  drift  to  the  cathode  to  be  trapped.  The  trapped  positive 
charge  increases  the  local  field  and  current  density  (through  barrier  lowering) 
and  further  accelerates  the  build-up  until  a  very  brief  run-away  process  brings 
the  oxide  to  destructive  breakdown. 
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5.  figure  captions 
Fig.  1) 

The  energy  band  diagram  in  the  SiOg  when: 

a)  Naf  ions  are  trapped  near  the  cathode. 

b)  holes  are  trapped  near  the  cathode. 

c)  a  sheet  charge  of  electrons  are  trapped  in  the  middle  of  the  oxide. 

Fig.  2) 

IV  characteristics  of  a  device  stressed  to  -IB. 2  C/cm2  (gate  injection,  i.e., 
gate  negative). 

a)  Substrate-injection  (gate  positive)  IV  characteristic. 

b)  Gate-injection  (gate  negative)  IV  characteristic. 

Table  I. 

Additional  charge  per  unit  area  necessary  for  breakdown  Qgo  as  a  function 
of  the  initial  charge  per  unit  area  passed  through  the  oxide  Q- and  the 
annealing  temperature.  -  refers  to  electron  injection  from  the  polysilicon 
gate,  and  +  refers  to  injection  from  the  silicon  substrate.  (  /  =  33  mA/cme 
)• 

Fg.  3) 

Additional  charge  per  unit  area  Qbd  necessary  for  breakdown  versus  the  ini¬ 
tial  charge  per  unit  area  <?•  passed  through  the  oxide.  Negative  refers  to 
electron  injection  from  the  polysilicon  gate,  and  positive  refers  to  injection 
from  the  silicon  substrate.  The  current  density  was  33  mA / cm 2. 
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Quasi-static  CV  characteristics  before  constant-current  stressing  (l)  and 
after  both  constai; ‘-current  stressing  and  annealing  (2  -  350C,  3  -  450C). 
Q  =  +18  C/cmz  (substrate  injection). 
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Modeling  the  Switch-Induced  Error  Voltage  on  a 
Switched-Capacitor 


BING  J.  SHEU  and  CHEN  MING  HU 


Abstract — An  analytical  model  tor  switch-induced  error  voltage  on  a 
switched  capacitor  is  derived.  A  compact  expression  contains  the  effects  of 
gate  voltage  falling  rate,  threshold  voltage,  and  storage  capacitance.  It  can 
be  used  to  quickly  predict  the  error  voltage.  The  model  is  in  good 
agreement  with  computer  simulations  using  SPICE  program  and  experi- 


Lut  of  Symbols 


Conductance  coefficient. 

Storage  capacitance. 

Gate-drain  overlap  capacitance. 

Gate  capacitance  (excluding  overlap  capacitance). 
Gate  capacitance  per  unit  area. 


Channel  conductance. 

Effective  channel  length. 

Lateral  diffusion  distance. 

Substrate  doping. 

Gate  oxide  thickness. 

Gate  voltage  falling  rate. 

Gate  voltage. 

High  value  of  Vc. 

Low  value  of  Vc. 

Signal  voltage  at  the  source. 

Zero-bias  threshold  voltage. 

Threshold  voltage  with  back-gate  bias. 

error  voltage  at  drain  end  at  time  t. 

error  voltage  at  drain  end  after  gate  voltage  reaches  VL. 

Absolute  value  of  vin. 

Channel  width. 
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In  the  design  of  precision  MOS  analog  circuits,  particularly 
switched-capacitor  circuits,  it  is  necessary  to  take  into  account 
the  switch -induced  error  voltage  on  a  switched  capacitor  (1|.  An 
MOS  transistor  holds  mobile  charges  in  its  channel  when  it  is  on 
When  the  transistor  turns  off,  some  portion  of  the  mobile  charges 
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Fit  1.  Schematic  of  the  switch  circuit  under  study. 

is  transfered  to  the  storage  capacitor  and  cause  an  error  to  the 
sampled  voltage  (see  Fig.  1).  The  clock  voltage  feedthrough 
through  the  gate-drain  overlap  capacitance  also  contributes  to 
the  error.  The  turnoff  of  an  MOS  switch  consists  of  two  distinct 
phases.  During  the  first  phase,  the  transistor  is  on  and  a  conduc¬ 
tion  channel  extends  from  the  source  to  the  drain  of  the  transis¬ 
tor.  As  the  gate  voltage  falls,  mobile  charges  exit  through  both 
the  source  end  and  the  drain  end.  In  the  presence  of  error  voltage 
there  is  also  a  conduction  current  flowing  through  the  channel 
between  the  source  and  the  drain  ends.  When  the  gate  voltage 
reaches  the  threshold  voltage,  the  conduction  channel  disappears 
(subthreshold  conduction  is  not  included  in  our  model),  and  the 
transistor  enters  the  second  phase  of  turnoff.  During  this  phase, 
only  the  clock  feedthrough  through  the  gate-drain  overlap  capa¬ 
citance  continues  to  raise  the  error  voltage.  Compensation  schemes 
[2],  (3]  have  been  used  to  reduce  the  error  voltage.  However,  no 
analytical  expression  is  available  in  the  literature  for  quick  pre¬ 
diction  of  the  error  voltage  and  the  guidance  of  circuit  design. 

In  this  paper,  an  analytical  expression  for  the  switch-induced 
error  voltage  is  derived.  Computer  simulations  and  experiment 
are  used  to  support  the  result 


II.  Analytical  Model  of  Error  Voltage 

We  assume  that  charge  pumping  phenomenon  (4]  due  to  the 
capture  of  channel  charges  by  the  interface  traps  is  not  signifi¬ 
cant  In  other  words,  when  the  transistor  turns  off,  all  the  channel 
mobile  charges  exit  through  the  source  and  drain  ends.  The 
circuit  schematic  to  be  analyzed  is  shown  in  Fig.  I  (nMOS  switch 
is  used  for  illustration).  The  source  end  of  the  switch  is  connected 
to  a  signal  voltage  source  with  value  Vs,  and  the  drain  end  is 
connected  to  a  storage  capacitor  with  capacitance  CL.  The  equiv¬ 
alent  lumped  models  for  the  circuit  during  the  first  and  second 
phases  of  turnoff  are  shown  in  Fig.  2(a)  and  2(b),  respectively. 
This  lumped  model  was  derived  from  an  exact  analysis  of  the 
distributed  MOSFET  model  [5). 

From  the  KCL  law 


r*£--cJc  i 

Cldt  ''  +  \C-+  2  )  dt 


We  assume  that  the  gate  voltage  is  a  ramp  function  which  begins 
to  fall  at  time  0  from  the  high  value  V„  toward  the  low  value  VL 
at  a  falling  rate  U. 

yc-y„-ut.  (2) 

Under  the  condition  \dVG/dt\  » \dvj/dt\,  (1)  simplifies  to 


When  the  transistor  is  operated  in  the  strong  inversion  region 
(V„>VG>VS  +  VT), 

i'j-Gv,mp(y„r-Ui)vj  (4) 


>ys*  vr  (b)  ys  +  yr>yc>  yL. 

where 

and  yHT-yH-ys-yT 

(3)  becomes 


The  solution  of  the  differential  equation  is 


Alt’  —  yHT/U,  the  threshold  condition  is  reached  ( Vc  -  ys  +  yr) 
and  the  first  phase  ends.  After  that  only  the  gate-drain  overlap 
capacitor  continues  to  contribute  to  the  error  voltage.  The  error 
voltage  at  this  time  is 


When  the  gate  voltage  reaches  its  final  value  V,  ,  the  total  amount 
of  switch- induced  error  voltage  on  a  switched  capacitor  is 


-j£(Vs  +  Vt-Vl)-  (8> 


(4) 
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Fig.  3.  Comparison  of  the  analytic  and  computer  simulated  transient  re¬ 
sponses  for  three  different  gate  voltage  falling  rates.  The  parameters  for  Figs. 
3  and  4  are  Vs-0.  CL- 2  pF.  -  0.60  a.  IF- 4  pm,  L-  3.3  pm, 
La  -  0.35  pm,  and  /9  —  30.3  pA-V2. 


- Analytical 

- Simulation  (SPICE  2) 


Falling  Rate  <Vs> 

Fig.  4.  Comparison  of  the  analytic  and  computer  simulated  results  of  the 
error  voltage  as  a  function  of  the  gate  voltage  falling  rate. 

III.  Comparison  with  Computer  Simulation 

To  validate  the  model,  compute  simulations  using  the  SPICE 
2G  [6],  [7]  circuit-simulation  program  have  been  performed.  The 
circuit  configuration  for  computer  simulations  is  the  same  as  that 
of  Fig.  1. 

The  analytical  transient  response,  (6),  and  computer  simulated 
results  for  three  different  gate  voltage  falling  rates  0.1  V/ns,  0.2 
V/ns,  and  0.5  V/ns  are  shown  in  Fig  3.  The  close  agreement 
between  the  analytical  analysis  and  the  computer  simulation  is 
evident.  Another  comparison  is  shown  in  Fig.  4.  The  error  voltage 
is  plotted  against  the  gate  voltage  falling  rate  for  both  the 
analytical  and  simulation  results.  Fig.  5  shows  the  measured  data 
and  calculated  result  from  the  analytical  model  (8).  Good  agree¬ 
ment  is  found. 

IV.  Conclusion 

An  analvucal  expression  for  the  switch-induced  error  voltage 
on  a  switched  capacitor  is  presented.  Computer  simulation  and 
experiment  justify  the  validity  of  the  analysis.  The  compact 


Falling  Rate  |V/5' 


Fig.  5.  Measured  and  calculated  error  voltages  as  functions  of  gate  voltage 
falling  rate.  The  parameters  are  Vs  -  0.  CL  -  24.5  pF.  effective  Qj  — 195  fF 
(including  parasitic  probe  capacitance).  VTO  -  0.70  V.  W  -  40  pm.  L  -  5.1 
A m.  La  -  0.45  pm,  t„  -  85  nm,  and  /3  -  295  pAV"2. 

expression  (8)  should  be  convenient  in  the  analysis  of  switched- 
capacitor  circuits,  such  as  A/D,  D/A  converters,  and  filters. 
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ABSTRACT 


A  concise  analytical  expression  for  switch-induced  error  vol¬ 
tage  on  a  switched  capacitor  is  derived  from  the  distributed  MOS- 
FET  model.  The  result,  however,  can  be  interpreted  in  terms  of  a 
simple  lumped  equivalent  circuit.  With  this  expression  we  explore 
the  dependence  of  the  error  voltage  on  process,  switch  turn-off 
rate,  source  resistance,  and  other  circuit  parameters.  These 
results  can  be  used  to  quickly  predict  the  error  voltage.  The 
analytical  expression  is  supported  by  the  close  agreement  with 
computer  simulations  and  experiments. 
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Conductance  coefficient. 

Storage  capacitance. 

Gate-drain  overlap  capacitance. 

Gate  capacitance  (excluding  overlap  capacitance). 

Gate  capacitance  per  unit  area. 

Channel  conductance. 

Effective  channel  length  (Iorakn  -  2Lc). 

Lateral  diffusion  distance. 

Substrate  doping. 

Source  resistance  of  the  signal  voltage  source. 

Gate  oxide  thickness. 

Gate  voltage  falling  rate. 

Gate  voltage. 

High  value  of  Vq. 

Low  value  of  Vq. 

Signal  voltage  at  the  source. 

Threshold  voltage  with  back-gate  bias. 

Zero-bias  threshold  voltage. 

Error  voltage  at  drain  end  at  time  t. 

Error  voltage  at  drain  end  after  gate  voltage  reaches  Vj,. 
Absolute  value  of  v^. 

Channel  width. 
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1.  Introduction 


The  error  voltage  induced  by  the  turning-off  of  an  MOS  switch  is  one  of  the 
fundamental  factors  that  limit  the  accuracy  of  switched-capacitor  circuits,  such 
as  A/D.  D/A  converters,  and  filters  [l].  An  MOS  transistor  holds  mobile  charges 
in  its  channel  when  it  is  on.  When  the  transistor  turns  off,  some  portion  of  the 
mobile  charges  is  transferred  to  the  storage  capacitor  and  causes  an  error  to 
the  sampled  voltage  (see  Fig.  l).  The  clock  voltage  feedthrough  through  the 
gate-drain  overlap  capacitance  also  contributes  to  the  error.  The  turn-off  of  an 
MOS  switch  consists  of  two  distinct  phases.  During  the  first  phase,  the  transistor 
is  on  and  a  conduction  channel  extends  from  the  source  to  the  drain  of  the 
transistor.  As  the  gate  voltage  falls,  mobile  charges  exit  through  both  the  source 
end  and  the  drain  end.  When  the  gate  voltage  reaches  the  threshold  voltage,  the 
conduction  channel  disappears  (subthresld  conduction  is  not  included  in  our 
analytical  analysis),  and  the  transistor  enters  the  second  phase  of  turn-off.  Dur¬ 
ing  this  phase,  only  the  clock  feedthrough  through  the  gate-drain  overlap  capa¬ 
citance  continues  to  increase  the  error  voltage.  The  switch-induced  error  vol¬ 
tage  on  a  switched  capacitor  can  be  reduced  by  turning  off  the  switch  very 
slowly  to  allow  charges  return  to  the  source  end  and  by  minimizing  the  part  of 
gate  voltage  swing  that  is  below  the  threshold  voltage  to  minimize  the  effect  of 
the  gate-drain  overlap  capacitance.  Compensation  schemes  [2],  [3]  have  been 
used  to  reduce  the  switch-induced  error  voltage.  However,  little  work  has  been 
done  on  the  analysis  of  this  phenomenon. 

In  this  paper,  we  analyze  the  switching-off  behavior  of  the  MOS  switch.  An 
analytical  expression  for  the  switch-induced  error  voltage  is  derived.  Using  this 
expression  we  explore  the  dependence  of  the  error  voltage  on  process  and  gate 
voltage  falling  rate.  These  results  can  be  used  to  quickly  predict  the  error  vol¬ 
tage.  Finally,  computer  simulations  and  experiments  are  used  to  validate  the 


analysis.  The  derivation  of  the  lumped  model  from  the  distributed  model  is 
attached  as  appendix  A. 


D.  Analytical  Model  of  Error  Yoltage 

We  assume  that  the  charge  pumping  phenomenon  [4]  due  to  the  capture  of 
channel  charges  by  the  interface  traps  is  not  significant.  In  other  words,  when 
the  transistor  turns  off,  all  the  channel  mobile  charges  exit  through  the  source 
and  drain  ends.  The  circuit  schematic  to  be  analyzed  is  shown  in  Fig.  1  (NMOS 
switch  is  used  for  illustration).  The  source  end  of  the  switch  is  connected  to  a 
signal  voltage  source  with  value  Vg.  and  the  drain  end  is  connected  to  a  storage 
capacitor  with  capacitance  Ci,.  The  equivalent  lumped  models  for  the  circuit 
during  the  first  and  second  phases  of  turn-off  are  shown  in  Fig.  2.  The 
configuration  of  the  lumped  model  is  not  arbitrarily  chosen  but  results  from  an 
exact  analysis  of  the  distributed  MOSFET  model  as  shown  in  the  appendix. 

From  the  KCL  law 


CL^-=-ii+(Col  + 


C0,  .  d(Vc— vd) 
2  '  dt 


(1) 


We  assume  that  the  gate  voltage  is  a  ramp  function  which  begins  to  fall  at  time  0 
from  the  high  value  Vh  toward  the  low  value  Vj_  at  a  falling  rate  U. 


Vc=VH-Ut 

Under  the  condition 


(2) 


dVc  I 

■an f 


dvd 


dt 


,  (1)  simplifies  to 


dvd  _  ^ 


CLit  id"(Col+T")u 


(3) 


••  This  assumption  is  also  needed  _n  deriving  the  lumped  model.  See  append. x 


When  the  transistor  is  operated  in  the  strong  inversion  region  (Vh^Vg^  Vs  +  V7), 


*d=  Gvd  -  fi  ( Vht-U t)  vd 
where /J  =  /i Cq,  and  VH7  =  VH -Vs -V7 
(3)  becomes 


CL^-=  (VKT-Ut)  vd  ~(C°I  +  %-)  U 


The  solution  of  the  differential  equation  is 


(4) 


(5) 


(6) 


At  t'=  -g— ,  the  threshold  condition  is  reached  (VG  =  VS  +  Vr)and  the  first  phase 

ends.  After  that  only  the  gate-drain  overlap  capacitor  continues  to  contribute 
to  the  error  voltage.  The  error  voltage  at  this  time  is 


vd(t) 


U  Cl 


20 


CL 


(?) 


When  the  gate  voltage  reaches  its  final  value  VL,  the  total  amount  of  switch- 
induced  error  voltage  on  a  switched  capacitor  is 
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tJ 


lAA*.!*.!, 


It  is  well  known  that 

I. 


erf (x) 


2x  ,  xfs 


if  x»  1 
if  x«l 


therefore,  expression  (8)  can  be  simplified  under  the  two  extreme  cases. 

for  slow  switching-off,  -rr-= — »  U 

2Ll 


vdn — 


C01  + 


Vs  +  vT  -  VL ) 


20  CL 


(9) 


pv£- 

for  fast  switching-off.  —  «  U 

2Ll 


vdm  = 


r  + 

coi+  — 

L  t  ^  ’ 

+  £hl 

CL 

h-  6  U  Cl 

Cl 

(10) 


Ql.  Dependence  on  Process  and  Electrical  Parameters 

The  switch-induced  error  voltage  on  a  switched  capacitor  is  affected  by 
many  factors.  The  factors  of  the  greatest  interest  are  the  gate  voltage  falling 
rate,  signal  voltage  level,  substrate  doping,  oxide  thickness,  transistor  size,  and 
source  resistance  of  the  signal  voltage  source.  Common  parameter  values  used 
in  the  following  examples  are:  W  =  4/im,  L  =  3.3/zm,  Ld  =  0.35 /im.  Cl  =  2pf,  t0I  = 
70nm.  Nsub  =  5.0*1014cm~3,  Vto  =  0.6v,  Vh  =  5v,  Vl  =  Ov,  U  =  lv-ns-1,  /^C01  = 
ZS^lO^A'v"8.  These  are  typical  values  to  be  found  in  the  state-of-the-art  circuit 
designs.  From  expression  (8)  we  notice  that  the  switch-induced  error  voltage  of 
a  NMOS  switch  is  negative.  In  the  following  discussion  we  will  focus  on  the  abso¬ 
lute  value  of  this  error  voltage  and  denote  it  as  v^,. 
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A  Dependence  on  Gate  Voltage  Falling  Rate 

Error  voltage  ,  v,^  ,  is  plotted  against  the  gate  voltage  falling  rate  ranging 
from  10**4  vs"1  to  10**11  vs"1  for  four  signal  volage  levels,  Ov,  Iv,  2v,  and  3v  in 
Fig.  3.  In  the  case  of  slow  falling  rate,  most  of  the  channel  charges  return  to  the 
source  when  the  switch  is  on,  and  the  error  voltage  is  primarily  due  to  the  clock 
feedthrough  of  the  gate-drain  overlap  capacitance  after  the  switch  is  turned  off. 
At  very  slow  falling  rate,  v^  saturates  at  (Vs  + V7~Vl)C0|  /Cl  as  can  be  seen 
from  expression  (9).  In  the  case  of  fast  falling  rate,  nearly  one  half  of  the  chan¬ 
nel  charges  are  deposited  in  the  storage  capacitor  and  vdrn  saturates  at 
Vkt(Cci  +  C0I/2)/Cl+(Vs  +  V7-Vl)C0i/Cl  as  can  be  seen  from  expression  (10). 
From  Fig.  3,  it  is  obvious  that  the  dependence  of  v^  on  Vs  may  be  minimized  by 
judiciously  choosing  the  falling  rate. 

B.  Dependence  on  Signal  Voltage  Level  and  Substrate  Doping 

Error  voltage  ,  v^  ,  is  plotted  against  signal  voltage  Vg  for  five  substrate 
dopings  in  Fig.  4.  As  the  substrate  doping  increases,  body  effect  increases 
accordinly,  which  in  turn  causes  the  thresold  voltage  in  expression  (8)  to 
become  more  sensitive  to  Vs.  The  argument  of  the  error  function  in  expression 
(8)  is  smaller  for  larger  V$  or  heavier  substratee  doping.  As  long  as  the  first 
term  in  (8)  dominates,  e.g.  at  fast  falling  rates,  is  a  strong  function  of  V$  and 
more  so  in  the  heavy-substrate-doping  circuits. 

C.  Dependence  on  Oxide  Thickness 

Advances  in  silicon  technologies  continue  to  make  smaller  MOS  device 
dimensions  possible.  As  the  device  size  shrinks,  the  oxide  thickness  reduces, 
too.  If  storage  capacitor  oxide  and  gate  oxide  are  scaled  by  the  same  factor, 
(Coi+Coj/Zj/Q,  remains  constant.  The  effect  of  Cl  increase  due  to  oxide  reduc¬ 
tion  is  exactly  balanced  by  the  effect  of  j?  increase  (  assuming  constant  W/L  ) 
such  that  the  square  root  term  and  the  error  function  term  in  expreesion  (8) 


are  unaltered.  Hence,  error  voltage  v^  is  not  affected. 

D.  Dependence  on  Channel  Width  and  Length 

Transistor  size  is  one  of  the  most  important  variables  in  circuit  design. 
Designers  have  to  choose  the  appropriate  combination  of  transistor  sizes  in 
order  to  achieve  optimum  circuit  performance.  Error  voltage  vto  is  plotted 
against  channel  length  ranging  from  1  pm  to  10  ^m  for  four  different  channel 
widths,  1  pm,  4  pm,  7  pm,  and  10  pm  in  Fig.  5.  It  is  clear  that  smaller  transistor 
size  introduces  smaller  error  voltage  with  the  set  of  typical  circuit  parameter 
values  listed  at  the  beginning  of  this  section. 

K  Effect  of  Source  Impedance 

Source  impedance  of  the  signal  voltage  affects  the  error  voltage.  The  cir¬ 
cuit  schematic  which  includes  a  source  resistance  is  shown  in  Fig.  6.  Derivation 
of  the  analytical  model  including  source  resistance  is  quite  similar  to  that 
without  source  resistance  and  is  attached  as  appendix  B.  Error  voltage  is  plot¬ 
ted  against  source  resistance  in  Fig.  7.  As  the  source  resistance  increses,  fewer 
channel  charges  return  to  the  source  end  of  the  transistor  and  error  voltage 
becomes  larger. 

IV.  Comparison  with  Computer  Simulation 

To  validate  the  model,  computer  simulations  using  the  SPICE  2G  [5],  [6] 
circuit-simulation  program  have  been  performed.  The  circuit  configuration  for 
computer  simulations  is  the  same  as  that  of  Fig.  1.  One  example  of  SPICE  input 
file  is  as  following: 

SIMULATION  OF  SWITCH-INDUCED  ERROR  VOLTAGE  ON  A  SWITCHED  CAPACITOR 
•XQCC0.5  FOR  CHARGE  CONTROLLED  MODEL 


Ml  2  1  3  0  MODN  W=4U  L=4U 


CSTORAGE  2  0  2P 

.MODEL  MODN  NMOS  LEVEL=2  T0X=70N  NSUB=5E14  KP=25U  LD=0.35U 
+VT0s0.6  VMAX=5E4  CGS0=1.7255E-10  CGD0=1.7255E-10  XQC=0.4999 
VG  1  0  PWL  (0  5  ION  5  SON  0) 

VS  3  0  DC  0 

.OPTIONS  ABST0L=1E-14  CHGT0L=1E-16  RELT0L=lE-5 
.TRAN  IN  65N 
.PRINT  TRAN  V(2)  V(l) 

.END 

The  analytical  transient  response,  expression  (6),  and  computer  simulated 
results  for  three  different  gate  voltage  falling  rates  O.lv/ns.  0.2v/ns,  and  0.5v/ns 
are  shown  in  Fig.  0.  The  close  correspondence  between  the  analytical  analysis 
and  the  computer  simulation  is  evident  The  error  voltage  is  plotted  against  the 
gate  voltage  falling  rate  for  both  the  analytical  and  simulation  results  in  Fig.  9. 
The  agreement  is  excellent.  Error  voltages  from  both  analytical  and  computer 
simulated  results  are  shown  in  Fig.  7.  Computer  simulated  result  is  shown  in 
Pig.  7  together  with  the  analytical  result  for  the  circuit  schematic  of  Fig.  6. 
They  match  very  well.  Two  other  curves  of  simulated  results  in  Fig.  7  correspond 
to  the  case  where  source  capacitance  exists  in  parallel  with  source  resistance. 
The  existence  of  the  source  capacitor  compensates  the  effect  of  source  resis¬ 
tance  and  inhibit  error  voltage  from  increasing  too  much.  The  larger  the  capa¬ 
citance  is,  the  more  the  compensation  effect  will  be. 

Charge  controlled  model  is  used  in  SPICE  simulation  by  specifying  the  XQC 
parameter  a  value  smaller  than  0.5  so  that  charge  conservation  is  retained.  If 
XQC  parameter  is  assigned  a  value  greater  than  or  equal  to  0.5.  Meyer's  capaci¬ 
tance  model  is  automatically  employed  and  charge  conservation  is  not 
guaranteed  [7].  Meyer’s  capacitance  model  is  implemented  in  SPICE  MOS  level-2 


model  in  such  a  manner  as  to  improve  the  convergence  of  cir'uit  simulation  [6]. 
However,  it  might  introduce  a  small  error  to  those  transient  simulations  which 
are  very  sensitive  to  the  capacitive  currents  of  the  transistors.  The  error  intro¬ 
duced  is  insignificant  for  most  circuit  simulations  but  may  be  quite  serious  in 
simulating  switched  capacitor  circuits.  One  curve  corresponding  to  simulated 
results  using  SPICE  MOS  level-2  and  Meyer’s  capacitance  model  is  also  shown  in 
fig.  9.  SPICE  simulation  with  Meyer's  capacitance  model  generates  an  error  of  a 
fraction  of  a  millivolt.  The  analytical  model  presented  in  this  paper  has  no  hid¬ 
den  error  to  the  extent  that  its  underlying  assumptions,  which  are  easily 
identified,  are  valid. 


V.  Experimental  Results 

Experimental  transistors  for  the  MOS  switch  were  designed  and  fabricated 
using  a  local  oxidation  polysilicon  gate  CMOS  process.  The  transistor  parame¬ 
ters  are  listed  in  Table  I.  The  stray  capacitance  between  the  probes  of  a  on- 
wafer  testing  station  is  quite  large  when  the  probes  are  close  to  each  other.  We 
put  the  transistors  inside  a  24-pin  package  to  reduce  such  inter-probe  capaci¬ 
tance. 

Precision  capacitance  meter  is  employed  to  determine  the  effective  storage 
capacitance,  Cl.  existing  at  the  drain  end.  The  measured  value  was  24.5pF.  Fig. 
10  shows  a  typical  turn-off  transient  response  of  the  switched  capacitor  circuit 
studied.  The  top  curve  is  Vc(t)  and  the  bottom  curve  is  vj(t).  The  lower  linear 
part  of  the  bottom  curve,  corresponding  to  the  second  phase  of  switch  turn-off, 
was  used  to  extract  the  total  capacitance  between  the  gate  pin  and  the  drain  pin 
of  the  switch,  including  the  true  transistor  gate-drain  overlap  and  the  parasitic 
probe  capacitances.  The  obtained  value  was  195fF.  It  was  used  as  C0i  in 
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expression  (8).  Switch-Induced  error  voltage  was  measured  against  the  gate  vol¬ 
tage  falling  rate  ranging  from  4xl03vs-1  to  3.3xl07vs_I  for  two  different  signal 
voltage  levels,  Ov  and  0.2v.  Fig.  11  shows  the  measured  data  and  calculated 
results  from  the  analytical  model  (expression  (B)).  Good  agreement  is  found  in 
both  cases. 

VI.  Conclusion 

An  analytical  expression  for  the  switch-induced  error  voltage  on  a  switched 
capacitor  is  presented.  The  expression  plainly  predicts  the  dependece  of  the 
error  voltage  on  gate  voltage  falling  rate,  signal  voltage  level,  transistor  size, 
and  process  changes.  For  example,  the  error  voltage  increases  with  increasing 
Vs  at  low  gate  voltage  falling  rates  and  decreases  with  increasing  Vs  at  high  gate 
voltage  falling  rates.  Computer  simulations  and  experiments  affirm  the  validity 
of  the  analysis.  The  compact  expression  (8)  should  be  convenient  in  the  analysis 
of  switched  capacitor  circuits. 
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Appendix  A:  DerivaUon  of  the  Lumped  Model  from  the  Distributed  Model 
Referring  to  Fig.  12, 


^-=-CCTW 

dy 


*a-C«W 


d[Vc-v(y)1 

dt 

dVG 

dt 


i(y)=i(0)-COIW  ^-y 

dVr 

14  *  i(L)  =  i(0)  -C0XWL  -j— - 


=  i(0)-Co 


dVG 

dt 


±L=i(y)?£L 

dy  W 

*l(0)^-Q^h^Ly 


v(y)=i(o)  —y-c^Rch^-^ 


v4  =  v(L)  =i(0)  ^-L  -Co^Reh  ~ 


_  i(0)  Cox  dVG 

G  2G  dt 


Eliminate  i(0)  from  (A3)  by  using  (A5) 
C„  dVc 


U=vdG- 


2  dt 


It  is  obvious  that 

r  dVd--iJ  +  r  dVc 
^  ~  4+001  “dT 

Hence  we  obtain  the  desired  expression. 


C=-cv(c-4)^ 


(Al) 

(A2) 


(A3) 


(A4) 


(A5) 


(AS) 


(A7) 


T  '  r*  ~ 


7  '  r 


—  •  •  ,  /  r»  ,  ox  \  * 


(A8) 


Expression  (A8)  may  be  interpreted  with  a  equivalent  lumped  circuit  as  shown  in 
Fig.  2(a).  It  is  trivial  to  show  that  Fig.  12  reduces  to  Fig.  2(b)  for  Wq<W$  +  \^. 
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Appendix  B:  Derivation  of  the  Analytical  Model  Including  Source  Resistance 
Referring  to  Fig.  6  and  Expression  (A8),  KCL  law  at  node  A  and  node  B 


require 


„  dvd  _  .  C„  N  d(Vc-vd) 

Cl  dT"  ld  +  (Col+  ~2~' — 5t - 


With  the  same  assumptions  in  section  U,  when  the  transistor  is  on,  (Bl)  and  (B2) 
simplify  to 


cL^-=-/3(VHT-ut)(v<1-v.)-(coi+  %-)U 


^0(VnT-Ut)(vd-v.)  +  (  C„j  +  £=-)U 


Using  (B4)  to  eliminate  v,  from  (B3).  we  obtain  a  first  order  differential  equation, 

r  dvd  _  P(VHT-Ut)  .  .  Cox  _ 1 _  TT 

L  dt  l+0Rs(V,rr-Ut)  d  '  ol  2  ;|  1 +0RS(VH7-Ut)  ] 


Solving  this  differential  equation  and  including  the  clock  feedthrough  due  to 
gate-drain  overlap  capacitance  when  the  transistor  is  off,  we  get  the  complete 
solution, 


C„  fc"tir 

v«-«^<V,  +  V,-Vl)+U  — 


exp  - 


UClRs 


x  f  (^Rs(Vht~U ()  +  l] 


Cl/TRs-RjU  t  1 _ 

exp  CLRS  1  +  /SRs(Vht-UO 


(B6) 


TABLE  I 


NMOS  SWITCH  PARAMETERS 

Vt(Vsb=0.0v) 

0.70v 

Vt(Vsb=0.2v) 

0.81v 

^DRAWN 

40 fim 

l<JRAffN 

6/zm 

0.45  (im 

P 

295 M-v'2 

Wx 

85nm 
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Rgure  Captions 


Fig.  1.  Schematic  of  the  switch  circuit  under  study. 

Fig.  2.  Equivalent  lumped  models  for  the  circuit  shown  in  Fig.  1. 

(a)  VhsVc^Vs  +  Vt  (b)  Vs  +  Vt^Vg^Vl-  This  equivalent  circuit 
is  derived  in  appendix  A. 

Fig.  3.  The  error  voltage  as  a  function  of  the  gate  voltage  falling  rate  for  four 
signal  voltage  levels. 

Fig.  4.  The  error  voltage  as  a  function  of  the  signal  voltage  level  for  five  sub¬ 
strate  dopings. 

Fig.  5.  The  error  voltage  as  a  function  of  transistor  channel  length  for  four 
channel  widths. 

Fig.  6.  Schematic  of  the  switch  circuit  with  source  resistor. 

Fig.  7.  Comparison  of  the  analytic  and  computer  simulated  results  of  the 
error  voltage  as  a  function  of  source  resistance.  Computer  simulated 
results  with  0.5pF/lpF  capacitance  in  parallel  with  Rs  are  also  shown. 

Fig.  B.  Comparison  of  the  analytic  and  computer  simulated  transient 
responses  for  three  different  gate  voltage  falling  rates. 

Fig.  9.  Comparison  of  the  analytic  and  computer  simulated  results  of  the 
error  voltage  as  a  function  of  the  gate  voltage  falling  rate. 

Fig.  10  Turn-off  transient  response  of  the  switched  capacitor  circuit  shown  in 
Fig.  1.  The  top  curve  is  the  waveform  applied  to  the  gate.  The  bottom 
curve  is  the  error  voltage  waveform  at  the  drain. 

Fig.  11  Measured  and  calculated  error  voltages  as  functions  of  gate  voltage 
falling  rate  for  two  signal  voltage  levies. 

Fig.  12.  Distributed  model  for  the  circuit  shown  in  Fig.  1. 
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SUBSTRATE  POTENTIAL  CALCULATION  FOR  LATCH-UP 

MODELING 

K  TERRILL  and  C.  HU 

Department  of  Electrical  Engineering  and  Computer 
Sciences 
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ABSTRACT 

The  input  and  output  circuits  are  the  main  triggering  mechan¬ 
isms  of  latch-up  in  CMOS  technology.  We  have  studied  the  trigger¬ 
ing  capability  of  these  circuits  and  the  effectiveness  of  using 
guard-rings  to  suppress  triggering.  We  present  a  method  to  esti¬ 
mate  the  substrate  potential  induced  by  a  triggering  current  in 
these  circuits  and  the  effect  of  using  guard-rings  to  prevent  latch- 
up.  It  was  necessary  to  include  effects  of  the  field-threshold 
implant  to  obtain  good  agreement  between  theory  and  measure¬ 


ments. 


SUBSTRATE  POTENTIAL  CALCULATION  FOR  LATCH-UP 

MODELING 

K  TERRILL  and  C.  HU 

Department  of  Electrical  Engineering  and  Computer 
Sciences 

University  of  California,  Berkeley 


INTRODUCTION 

Bulk  CMOS  integrated  circuits  contain  both  parasitic  vertical  and  lateral 
bipolar  transistors.  These  transistors  form  a  pnpn  structure  (a  generic  SCR) 
that  can  latch-up  if  the  positive  feedback  between  the  coupled  transistors  pro¬ 
duces  regenerative  switching.1  It  is  possible  to  prevent  latch-up  by  suppressing 
those  mechanisms  which  trigger  the  SCR  into  the  turned-on  state.  The  main 
triggering  mechanisms  discussed  in  the  literature  are  current  injection  from 
the  input  and/or  output  circuitry.  Either  majority  or  minority  carrier  current 
can  be  injected  into  the  substrate  from  the  I/O  circuitry.  We  investigate  in  this 
study  the  possible  triggering  due  to  majority  carrier  current  and  also  look  at 
the  effectiveness  of  using  guard  rings  to  prevent  triggering  for  this  case. 
Although  circuit  models  for  the  parasitic  SCR  have  been  presented  z-3  these 
models  give  no  means  for  calculating  the  substrate  resistance.  The  method  we 
present  allows  the  calculation  of  this  resistance. 

Fig.  1  shows  a  typical  input  protection  circuit  used  in  an  N  well  technology.4 
When  the  input  voltage  is  positively  large  the  forward  biased  p+/n  diode  injects 
minority  carriers  into  the  well  after  which  they  are  quickly  swept  into  the  sub¬ 
strate  and  become  majority  carriers.  In  the  substrate  these  carriers  generate  a 
three  dimensional  ohmic  drop  as  they  spread-out.  If  the  current  is  high  then 
the  surface  potential  may  become  large  enough  to  forward  bias  a  nearby  diode. 


lHe  minority  carriers  injected  by  this  nearby  diode  that  are  collected  by  a  well 
may  lead  to  latch-up.  A  grounded  guard  ring  located  either  near  the  injecter  or 
the  diode  will  help  hold  the  surface  potential  at  ground  thus  preventing  the 
diode  from  becoming  forward  biased.  It  is  also  possible  for  the  drain  of  an  out¬ 
put  device  to  become  forward  biased  inside  the  well.  The  forward  biasing  of  a 
drain  p+/n  junction  will  cause  the  same  effect  as  described  above.  Although  we 
present  the  case  of  an  N  well  technology,  the  above  arguments  also  hold  for  a  P 
well  technology  with  a  simple  sign  reversal  for  voltage  and  current.  Another  pos¬ 
sible  mechanism  for  substrate  current  is  Hot-Electron  generated  substrate 
currents  from  n  MOS  devices. 

We  are  using  p+  surface  contacts  to  emulate  the  injection  of  current  into 
the  substrate  from  a  forward  biased  p+/n  junction  inside  the  well.  Measure¬ 
ments  on  a  CMOS  N  well  technology  wafer  have  shown  that  the  surface  potential 
due  to  a  forward  biased  diode  inside  does  not  differ  by  more  than  15%  from  the 
surface  potential  due  to  a  surface  contact  when  both  are  injecting  the  same 
current.  There  was  no  significant  variation  in  the  surface  potential  when  the  N 
well  bias  voltage  was  varied  as  long  as  the  well/substrate  junction  was  reversed 
biased.  During  all  measurements  the  backside  is  grounded.  In  what  follows  we 
will  refer  to  the  ratio  of  the  surface  potential  ^,(r)  to  the  injecting  current  as 
the  transverse  resistance  Rt  at  point  r. 

DISCUSSION 

A  calculation  of  the  surface  potential  due  to  a  surface  contact  must  take 
into  account  the  highly  conductive  field  implantation  layer  on  the  silicon  sur¬ 
face.  This  layer  typically  is  about  one  micron  in  depth  and  has  a  resistivity 
which  is  at  least  an  order  of  magnitude  lower  than  the  substrate  resistivity.  The 
main  effect  of  this  low  resistivity  surface  layer  is  the  reduction  in  the  spreading 


resistance  between  a  surface  contact  and  the  backside  contact.  If  we  assume 
this  low  resistivity  region  has  a  uniform  concentration  the  boundary  conditions 
corresponding  to  this  case  are: 

I  The  potential  at  the  surface  in  the  contact  region  is  constant. 

II  The  current  component  at  and  normal  to  the  surface  is  zero  in  the  noncon- 
tacted  region. 

III  Both  the  potential  and  the  current  are  continuous  across  the  boundary 
separating  the  low  resistivity  surface  region  from  the  substrate. 

IV  The  potential  at  the  back  of  the  wafer  is  zero. 

If  the  region  in  which  the  surface  potential  is  to  be  calculated  is  small  com¬ 
pared  to  the  thickness  of  the  wafer  then  the  fourth  boundary  condition  can  be 
replaced  with  a  infinite  substrate  with  little  variation  in  the  calculated  potential. 
We  have  assumed  an  infinite  substrate  in  all  calculations  that  follow.  The  wafers 
utilized  have  a  thickness  of  ~325  micron  and  the  results  of  our  calculations  are 
indeed  accurate  for  lateral  distances  less  than  this  thickness. 

It  has  been  shown  that  the  potential  distribution  due  to  a  surface  contact, 
on  a  substrate  composed  of  a  thin,  low-resistivity  layer  overlying  a  high  resis¬ 
tivity  substrate,  is  not  strongly  influenced  by  the  particular  form  of  an  assumed 
injecting  current  distribution.5  We  will  utilize  this  fact  to  find  a  approximate 
closed  form  solution  for  the  case  of  a  circular  contact  This  is  true  as  long  as 
the  dimensions  of  the  surface  contact  are  not  less  than  the  depth  of  the  conduc¬ 
tive  layer. 

RESULTS 

The  resistance  between  a  square  surface  contact  of  side-length  L  and  the 
backside  is  shown  in  Fig.  2  as  a  function  of  both  field  implant  resistivity  and 
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side-length.  The  resistance  is  calculated  by  using  a  circular  disk  of  the  same 
area  and  assuming  the  current  on  the  disk  has  the  same  distribution  as  the  case 
of  a  disk  on  a  homogeneous  substrate  8.  By  using  this  assumed  current  distribu¬ 
tion  the  surface  potential  can  be  found  by  imaging  this  current  through  both  the 
surface  boundary  and  the  boundary  between  the  low  resistivity  layer  and  the 
substrate.  The  field  implant  profile  is  assumed  to  be  of  uniform  concentration 
with  a  depth  equal  to  1  micron  and  the  substrate  resistance  is  21  ohm-cm.  The 
resistance  between  the  surface  contact  and  the  backside  is  given  by: 


p,  is  the  substrate  resistivity,  pj  is  the  resistivity  of  the  field  implant  region 
and  Tj  is  the  depth  of  this  region  below  the  surface.  The  correlation  between 
the  calculated  and  the  measured  resistance  are  as  close  as  can  be  expected 
since  the  actual  field  implant  profile  can  not  be  accurately  determined.  The 
field  resistivity  used  to  fit  the  data  differed  by  1 1%  from  the  average  resistivity 
SUPREM  II  predicted  over  this  region  and  the  substrate  resistance  was  obtained 
from  a  four  point  probe  measurement. 

Fig.  3  shows  both  the  measured  and  calculated  transverse  resistance  of  a 
surface  contact  as  a  function  of  the  distance  from  the  edge  of  the  injecter  "d" 
and  the  injecter  side-length  "L".  We  define  thu  transverse  resistance  at  a  dis¬ 
tance  d  to  be  the  ratio  of  the  surface  potential  at  that  distance  to  the  injecter 
current.  Here  the  field  resistivity  is  fixed  at  1.20  ohm-cm.  The  transverse  resis¬ 
tance  is  given  by: 
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p  V  /c*1  sin” ■  -  i  ,  ^  ^| 

n=i  \/d.2H2nTf)2+\/(d  +  Za)z+{2nT/)* 

We  also  present  in  Fig.  3  the  solution  of  a  surface  contact  on  a  homogeneous 
substrate  with  L  equal  to  10  micron  for  comparison.  As  the  distance  d  increases 
the  transverse  resistance  asymptotically  approaches  the  value  predicted  for  a 
homogeneous  substrate. 

Fig.  4  shows  the  transverse  resistance  and  resistance  to  backside  for  a  sur¬ 
face  contact  which  completely  encloses  a  region  in  its  interior.  The  experimen¬ 
tal  geometry  used  is  also  shown  in  Fig.  4.  The  experimental  geometry  is  a 
square  loop  where  each  side  is  of  length  L  (50  micron)  and  thickness  T  (vari¬ 
able).  In  order  to  simplify  the  calculation  the  enclosing  guard-ring  is  assumed 
to  be  a  circular  ring  of  inner  radius  r4  and  outermost  radius  rb.  Since  no 
analytical  solution  exist  for  this  geometry  it  was  necessary  to  solve  for  the 
potential  by  using  numerical  methods.  The  potential  was  assumed  constant  over 
the  injecting  region  since  this  is  the  proper  boundary  condition  for  a  surface 
contact.  The  current  density  through  a  circular,  enclosing  guard-ring,  which 
has  a  constant  potential  V.  can  be  formulated  as  an  integral  equation. 

r» 

V  =  f  r‘  dr'  J(r')  G(r.r')  (3) 

rm 

Here  r  is  restricted  to  the  injecting  region  where  =  V. 

"  lx 

C(r .r ’)  =  Pj  fdq  J0  ( gr ’ )  J,  ( qr )  \  — H  (4) 

o  1-xe  J 
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The  integral  equation  is  solved  numerically  to  obtain  the  current  density 
J(t')  between  ra  and  rb.  After  the  current  density  is  obtained  the  surface 
potential  can  be  computed: 

*■» 

<p(r)  =  f  ?  dr'  J(f)  G{tJ)  (5) 

rm 

The  circular  ring  used  is  that  with  the  same  total  area  as  the  test  structure  and 
a  mean  radius  equal  to  (L+T)/2  .  It  is  important  that  ample  contacts  be  made  to 
the  guard-ring  structure  to  maintain  a  constant  potential  over  the  guard-ring. 
The  improvement  in  completely  enclosing  a  diode  is  clearly  displayed  here 
where  the  surface  potential  drops  much  less  rapidly  inside  the  guard-ring  than 
outside  the  guard-ring.  Although  all  the  data  is  given  for  a  substrate  resistivity 
of  21  ohm-cm  the  values  of  the  resistances  can  be  scaled  with  the  substrate 
resistivity  as  long  as  the  ratio  between  the  field  resistivity  and  the  substrate 
resistivity  is  held  fixed. 

The  solution  for  an  injecter  in  the  presence  of  a  guard-ring  can  be  found 
once  the  surface  contact  to  backside  resistance  and  the  transverse  resistances 
are  determined.  Before  showing  this  we  must  first  define  some  terms  that  are 
necessary  in  our  formulation. 

The  terms  are  defined  as  follows: 

Va 

R*  =  — —  with  L  =0  and  fd=0 
Ii  9 

Vd 

R^  =  - —  with  ^  =0  and  /B  =0 
•a 

y 

Rxg  -  -—  with  It  =0  and  Id  =0 
J9 
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Rgx  -  ~f—  with  Ig  =0  and  Id  =0 

■*i 


Rg  =  -f—  with  /t= 0  and  Id- 0 

*a 


Where  T*  ,  ”g"  and  "d"  refer  to  the  injecter  the  guard  ring  and  the  nearby 
diode  respectively.  R&  .  R^  ,  ect  are  transverse  resistances.  Rg  is  the  resis¬ 
tance  between  the  guard-ring  and  the  backside.  All  the  terms  except  the  last 
may  be  calculated  by  equation  (2)  or  (5).  It  is  necessary  when  doing  so  to  aver¬ 
age  the  potential  given  by  these  equations  over  the  region  of  interest.  The  last 
term  .the  resistance  between  the  guard-ring  and  the  backside,  maybe  calcu¬ 
lated  using  equations  (l)  or  (3V 

Since  the  current  injected  into  the  substrate  by  the  forward  biased  diode 
inside  the  well  is  a  constant  current  (independent  of  substrate  potential),  a  solu¬ 
tion  can  be  found  by  simple  superposition  of  the  potential  due  to  the  injecting 
source  and  the  potential  due  to  the  guard-ring.  If  the  guard-ring  is  grounded 
then  the  guard-ring  current  must  be: 


It  is  important  to  note  that  the  substrate  current  is  equal  to  the  difference 
between  the  injected  current  and  the  guard-ring  current.  This  current  path 
cannot  be  neglected  unless  7?^  is  almost  equal  to  Rg  .  From  Figs.  (2)  and  (3)  it 
is  easy  to  see  that  this  condition  does  not  exist  since  the  transverse  resistance 
drops  rapidly  as  the  distance  from  the  injecter  edge  increases. 

The  solution  for  an  injecter  in  the  presence  of  a  guard-ring  is  given  by: 


Ra 


(7) 


From  reciprocity  considerations  it  is  easy  to  show  that  R ^  =  R^,  ect.  This 
shows  that  the  same  transverse  resistance  is  obtained  if  the  injecter  and  the 
nearby  diode  are  switched  so  that  i=>d  and  d=>i  in  the  above  equations.  When 
calculating  R&  and  we  did  not  account  for  the  constant  potential,  guard- 
ring  surface  contact.  Although  this  contact  is  floating  it  still  has  some  effect  on 
the  potential.  In  neglecting  this  floating  guard-ring  we  introduce  some  error  in 
or  calculation. 

As  an  example  in  Fig.  5  we  show  the  effective  substrate  resistance  (Kj/A) 
due  to  a  injecter  of  side-length  L=  12  micron  which  has  a  nearby  guard-ring  of 
side-length  L=12  micron  located  at  a  distance  d=18  micron  from  the  edge  of  the 
injecter.  We  present  two  cases  of  interest.  In  the  first  case  the  guard-ring  is 
located  between  the  injecter  and  the  diode  while  in  the  second  case  the  guard- 
ring  is  placed  on  the  opposite  side  of  the  injecter.  For  both  cases 
Rg  -  2.75  K  ohm  and  R ^  =  .971  K  ohm  using  equations  (l)  and  (2)  respectively. 
The  transverse  resistance  Rv  was  average  over  the  region  of  interest.  The 
guard-ring  current  is,  thus,  equal  to  35.3  7m  of  the  injected  current  The  remain¬ 
ing  current  passes  through  the  backside  contact  We  also  show  the  resistance 
due  to  an  injecter  with  no  guard-ring  present  for  comparison  as  a  broken  line. 
As  expected  the  placement  of  the  guard-ring  has  a  major  impact  on  its  ability  to 
protect  a  nearby  diode.  When  the  guard-ring  is  placed  between  the  injecter  and 
the  diode  its  effectiveness  is  improved,  so  that  the  substrate  resistance  is 
reduced  considerably. 

CONCLUSIONS 

We  have  presented  a  means  to  calculate  the  substrate  potential  induced  by 
a  triggering  current,  as  in  typical  CMOS  1/0  circuitry.  This  method  may  also  be 
used  to  find  the  substrate  resistance  in  interna]  ]atch*up  structures.  The 


effectiveness  of  using  a  guard-ring  to  suppress  triggering  can  be  computed  by 
using  superposition.  In  order  to  obtain  an  accurate  prediction  it  was  necessary 
to  account  for  the  effect  of  the  field-threshold  implant.  It  is  important  to  include 
the  current  path  to  the  backside  when  estimating  the  substrate  resistance  in 
latch-up  structures. 
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Figure  Captions 

A  input  protection  circuit  used  with  an  input  inverter  and  its  cross-section 
showing  the  p+/n  and  n+/p  diodes. 

The  resistance  between  a  square  surface  contact  of  side-length  X  and  the 
backside  as  a  function  of  both  the  resistivity  of  the  field  implant  region  "  pj 
"  and  side-length.  The  substrate  resistivity  is  21  ohm-cm. 

The  transverse  resistance  of  a  surface  contact  as  a  function  of  the  distance 
from  the  edge  of  the  injecter  "  d  "  and  the  injecter  side-length  "  X  ".  The 
broken  line  shows  the  transverse  resistance  for  a  homogeneous  substrate 
with  X  =  10  micron.  The  substrate  and  field  implant  region  resistivities  are 
21  and  1.2  ohm-cm  respectively. 

The  transverse  resistance  and  the  resistance  to  backside  for  a  surface  con¬ 
tact  which  completely  encloses  a  region  in  its  interior.  The  resistance  is 
shown  as  a  function  of  the  thickness  of  the  ring  "  T  "  and  the  distance  from 
its  center  "  d  ".  The  substrate  and  field  implant  region  resistivities  are  21 
and  1.2  ohm-cm  respectively. 

The  effective  substrate  resistance  due  to  a  injecter  of  side-length  X  =  12 
micron  which  has  a  grounded  guard-ring  of  side-length  X=12  micron 
located  at  a  distance  of  18  micron  from  the  edge  of  the  injecter.  The  resis¬ 
tance  is  shown  as  a  function  of  the  distance  d  from  the  injecter.  edge.  The 
resistance  due  to  the  same  injecter  with  no  guard-ring  present  is  shown  by 
the  broken  line.  The  substrate  and  field  implant  region  resistivities  are  21 
and  1.2  ohm-cm  respectively. 
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Abstract— We  present  an  alalysis  of  the  collection  alpha-panicle  generated  charge  by  collectors 
surrounded  by  either  uniform  reflecting  or  uniform  absorbing  surfaces.  These  are  the  two  extreme  cases 
of  any  real  condition  lhat  exists  in  ICs.  The  analysis  sf  the  upper  limit  of  charge  collection  should  be 
more  useful  for  circuit  design  than  the  previously  available  lower  limit.  It  is  assumed  that  the  charge 
transport  is  by  diffusion.  The  effects  of  collector  size,  s  -particle  energy,  and  the  separation  between  the 
collector  and  the  alpha  track  are  studied  When  the  s-particle  strike  is  through  the  center  of  the  collector, 
the  difference  in  collected  charge  for  the  two  cases  is  up  to  a  factor  of  two  When  the  i-particle  strike 
does  not  pass  through  the  collector,  the  difference  is  much  greater.  The  collected  charge  scales 
approximately  linearly  with  the  collector  side  length. 


t.  INTRODUCTION 

The  discovery  by  May  and  Woods!  1)  that  low  levels 
of  i-particle  radiation  originating  from  packaging 
materials  can  cause  soft  errors  in  MOS  dynamic 
RAM's  has  created  considerable  interest  in  the  simu¬ 
lation  of  the  transport  and  collection  of  carriers 
generated  along  an  alpha-particle  track.  The  soft 
error  rate  depends  on  the  frequency  of  s- particle 
stnkes.  the  physics  of  charge  generation,  the  col¬ 
lection  process  and  the  circuit  vulnerability.  This 
paper  is  mainly  concerned  with  the  collection  process 
and  its  sensitivity  to  the  boundary  conditons  which 
exist  at  the  surface  of  the  silicon  chip. 

Kirkpatrick  has  presented  a  thorough  analysis  of 
the  collection  phenomenon  assuming  a  diffusion 
transport  mechanism  with  a  collector  surrounded  by 
a  uniform  absorbing  surface(2|.  This  boundary  con¬ 
dition  can  represent  the  surface  under  a  depletion 
region  which  may  be  assumed  to  be  absorbing  (hav¬ 
ing  infinite  recombination  velocity).  Kirkpatrick  was 
able  to  obtain  an  analytic  solution  since  this  bound¬ 
ary  condition  resulted  in  a  uniform  surface  boundary 
condition.  However,  it  is  unknown  how  much  more 
charge  will  be  collected  when  the  surrounding  surface 
area  is  under  the  field  oxide  and  may  be  assumed  to 
be  reflecting  (having  zero  recombination  velocity). 
For  the  purpose  of  circuit  design,  it  is  more  important 
to  know  this  upper  bound  of  charge  collection  than 
the  lower  bound  that  Kirkpatrick  derived. 

The  difficulty  that  occurs  when  one  tries  to  account 
for  both  absorbing  and  reflecting  surfaces  is  due  to 
the  mixed  boundary  conditions  which  now  exist  at 
the  surface.  These  boundary  conditions  make  it  im¬ 
possible  to  c  Stain  an  analytic  solution  to  the  diffusion 
equation.  One  attempt  at  accounting  for  the  presence 
of  both  absorbing  and  reflecting  surfaces  relied  on 
Monte  Carlo  simulations!),  4], 


This  paper  presents  a  solution  to  the  mixed  bound¬ 
ary  problem  of  a  collector  surrounded  by  a  uniform 
reflecting  surface  and  compares  it  to  the  results  of 
Kirkpatrick's  boundary  condition  as  the  two  extreme 
cases  of  any  real  condition  in  IC’s.  Another  difference 
between  Kirkpatrick's  work  and  the  present  paper  is 
lhat  the  former  considered  circular  and  line  collectors 
while  the  present  paper  presents  results  for  square 
collectors. 

Recent  work  by  Hsieh  et  at.  (5, 6)  has  shown  that 
for  the  case  of  an  a -particle  passing  through  a 
depletion  region  the  dominant  transport  mechanism 
may  be  drift  rather  than  diffusion  for  a  short  period 
of  time  after  the  s-particle  stnkes.  In  this  case  the 
present  paper  may  still  be  able  to  help  estimate  the 
charge  collected  due  to  diffusion  after  this  time  period 
has  elapsed  [7], 

X.  MODEL 

2. 1  Initial  conditions  and  boundary  conditions 

The  rate  at  which  an  energetic  a -particle  lovts 
energy  as  it  travels  in  silicon  has  been  calculated  by 
Ziegler.  His  findings  show  that  s  -particles  of  the  same 
energy  always  take  the  same  distance  to  stop  and  the 
rate  of  energy  loss  through  lonua  ion  is  a  function  of 
only  the  distance  from  the  end  of  the  track.  The 
generated  carriers  thermaiue  in  less  than  one  pico¬ 
second  to  a  distance  of  0.1  pm  around  the  s-particle 
path.  The  initial  condition  resulting  from  an 
s-particle  strike  in  silicon  is  thus  a  tine  of  electron- 
hole  pairs.  The  pair  density  increases  and  eventually 
peaks  and  drops  to  zero  as  the  s-particle  slows  down 
toward  the  end  of  the  track. 

There  are  two  types  of  surfaces  for  the  silicon.  The 
surface  under  the  field  oxide  can  usually  be  consid¬ 
ered  a  reflective  boundary  owing  to  the  high  quality 
of  the  SiO,-Si  interface  and  the  heavy  field  doping 
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which  products  a  field  opposing  the  flow  of  electrons 
toward  the  surface.  The  surface  under  the  depletion 
region  can  be  considered  to  be  an  absorbing  bound¬ 
ary  since  the  electric  field  there  sweeps  came*  away 
at  a  faster  rate  than  they  can  diffuse  to  the  surface. 
In  the  following  discussions  we  will  neglect  the  finite 
depth  of  the  depletion  region  in  our  calculations.  This 
reduces  the  diffusion  problem  to  a  planar  geometry 
with  mixed  boundary  conditions  on  the  surface.  We 
will  present  solutions  for  both  a  collector  surrounded 
by  a  reflecting  surface  and  a  collector  surrounded  by 
an  absorbing  surface  as  the  two  extreme  cooditons 
which  may  exist  in  an  IC. 

For  a  rectangular  collector  surrounded  by  an  a  b- 
sorbing  surface  the  solution  may  be  obtained  easily 
from  the  method  of  images.  The  boundary  condition 
for  this  case  is  that  the  concentration  of  carriers  at  the 
surface  is  zero  over  the  entire  surface  [2]. 

For  a  rectangular  collector  surrounded  by  a 
reflecting  surface  we  have  solved  the  diffusion  equa¬ 
tion  numerically.  The  boundary  conditions  corre¬ 
sponding  to  this  case  are: 

(a)  The  concentration  of  earners  at  the  surface  in 
the  collector  region  is  aero. 

(b)  The  gradient  of  the  earner  concentration  at 
and  normal  to  the  surface  is  aero  in  the  reflecting 
region. 

2.2  Mathematical  formulation 

By  restricting  the  transport  mechanism  to  diffusion 
the  flux  through  a  collector  surrounded  by  a 
reflecting  surface  after  an  e-strike  can  be  found  by 
solving  the  diffusion  cqn  (I) 

-7  <» 

Where  d  •»  the  concentration  of  carriers  at  any 
position  f  and  any  time  /,  D  is  the  diffusion  constant, 
and  t  is  the  recombination  trine  in  the  substrate.  By 
making  the  substitution  d  ■  Aexp(— r/r)  the  re¬ 
combination  term  drops  out  of  the  diffusion  equation 
yielding 

iif 

DV*N  — ~  ■0.  (2) 

€t 

In  practice,  the  collection  process  ends  in  tens  of 
nonosecondt— a  time  much  shofler  than  the  re¬ 
combination  time.  As  a  result  the  difference  between 
d  and  N  will  be  ignored,  i.e.  recombination  will  be 
neglected,  in  this  paper.  It  would  be  straightforward 
to  include  the  effect  of  recombination,  such  as  might 
be  desirable  for  GaAs;  then  the  collected  charge  will 
be  dependent  on  t.  The  corresponding  Green's  func¬ 
tion  is  found  by  solving  the  differential  eqn  (3) 


9G(f.r,t-n 


(3) 


Where  G  is  a  Green's  function  with  an  observation 
point  f,  source  location  f  \  and  boundary  conditons 
that  have  not  yet  been  specified.  The  solution  of  the 
diffusion  equation  is  given  quite  generally  by  the 
Integral  eqn  (4) 

-/)^dr'|do'jG(rV\r-r')^^-N(f'.i) 

*  | + J  d f'G{jf.r.t)N(r. 0).  (4) 

Here  N(f\0)  is  the  initial  excess  carrier  concen¬ 
tration  generated  by  the  a -strike.  Any  initial  excess 
*  carrier  concentration  can  be  used  but  for  an  e-strike 
we  will  let  the  initial  excess  carrier  concentration  be 
a  line  of  charge  as  described  by  Ziegler[8]. 

In  order  to  solve  the  integral  equation  we  must 
apply  the  boundary  conditions  specified  and  make  an 
appropriate  choice  of  a  Green's  function  which  al¬ 
lows  us  to  solve  the  equation.  It  is  convenient  to 
choose  a  Green's  function  which  has  a  zero  gradient 
normal  to  the  surface  as  this  allows  us  to  simplify  the 
problem  to  finding  N  only  on  the  collecting  surface. 
This  is  accomplished  quite  easily  through  the  method 
Of  images.  It  is  dear  that  the  problem  of  finding  such 
a  Green's  function  is  equivalent  to  the  problem  of  the 
original  carrier  concentration  and  an  equal  carrier 
concentration  located  at  the  mirror-image  point 
above  the  plane  defined  by  the  position  of  the  surface. 
The  Green's  function  used  is  given  in  eqn  (S). 

*“"{-45^?)}}  I!> 

The  surface  integral  can  be  broken  up  into  three 
distinct  regions. 

(a)  The  surface  area  located  at  the  surface  of  the 
■Ikon  in  the  collector  region. 

(b)  The  surface  area  located  at  the  surface  of  the 
■Ikon  in  the  reflecting  region. 

(c)  The  surface  which  encloses  the  hemisphere  of 
infinite  radius. 

The  surface  integral  over  the  hemispherical  surface 
ai  infinity  can  be  shown  to  be  zero.  Thus,  the  time 
integral  will  be  zero  for  any  finite  time.  If  the  bulk 
recombination  is  induded  then  it  is  obvious  that  the 
time  integral  is  zero  even  for  an  infinite  time.  Over  the 
remaining  surfaces  the  gradient  of  the  Green's  func¬ 
tion  is  aero.  This  allows  us  to  reduce  the  integral 
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equation  by  setting  the  dipole  terra  equal  to  zero  Substituting  this  into  eqn  (B)  yields: 


■V(/.  t)mD 


| \r'G(f.r,t)N(f\Q). 
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J  d/’J  doG  (/,  i  —  t')F(f\  I') 
-jV’Gtfr”. 


r)IV(f.O).  (10) 


By  applying  boundary  condition  (b)  the  integral 
over  the  reflecting  surface  is  seen  to  be  zero.  Thus  the 
only  contributing  portion  of  the  integral  is  the  inte¬ 
gration  over  the  collector  region.  This  reduces  the 
integral  equation  to: 

N{r.i)-D  J’dr'J  da  G(r.r,i-ndN{'-]n 
+  |dr->G(f.fMW.O).  (7) 

where  S  represents  the  collector  surface  (region  I). 

Finally  to  form  an  integral  equation  we  need  to 
restrict  the  observation  point  f  to  lie  on  the  surface 
in  the  collector  region.  Under  the  limit  of  ap¬ 
proaching  this  boundary  the  concentration  N(r,  I ) 
must  go  to  zero  and  the  integral  equation  simplifies 
to: 

*  G(fJ\  t)N  (f\  0).  (I) 

We  are  interested  in  obtaining  the  flux  of  carriers 
passing  through  the  surface  in  the  collector  region. 
This  flux  is  related  to  the  carrier  concentration  by  the 
relation: 


Thus,  the  flux  of  carriers  through  the  collecting 
surface  can  be  found  from  the  numerical  solution  of 
the  integral  equation  over  this  area. 

1  X  CULTS 

(n  the  following  subsections  the  total  collected 
charge  is  presented  for  three  different  types  of 
s-particle  strikes.  The  first  case  occurs  when  an 
s -particle  passes  directly  through  the  center  of  the 
collector.  This  case  will  be  referred  to  as  a  “center 
hit”.  The  "center  hit”  case  has  been  published  before 
but  it  will  be  briefly  covered  again  here  for 
completeness [9).  The  second  case  occurs  when  an 
x-particle  passes  through  the  edge  of  the  collector. 
This  case  will  be  referred  to  as  an  "edge  hit".  The  last 
case  occurs  when  an  s-particle  passes  through  the 
reflecting  surface  surrounding  the  collector.  This 
will  be  referred  to  as  a  "near  miss".  The  results  given 
are  for  a  square  collector  with  alpha  strikes  occurring 
normal  to  the  surface.  The  figures  to  be  presented 
give  solutions  for  both  a  collector  surrounded  by  a 
reflecting  surface  as  an  upper  bound  (solid  line)  and 
a  collector  surrounded  by  an  absorbing  surface  as  a 
lower  bound  (broken  line).  In  both  cases  the  diffusion 
constant  used  is 

0-25=!. 


The  essential  difference  between  the  two  boundary 
conditions  can  be  appreciated  from  the  following  two 
examples.  Figure  I  shows  the  charge  flux  collected  by 
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Fig  I.  The  charge  flux  collected  by  a  J  *  5am  collector  from  a  point  source  located  st  the  center  of  the 
collector  and  J  pm  below  the  suifaet.  la  this  snd  tU  other  flguras  the  reflecting  surface  is  represented  by 
a  solid  line  while  the  absorbing  surface  is  represented  by  •  dashed  line. 


•  5  x  3  pm  collector  from  a  point  source  located 
5  pro  below  the  surface  The  boundary  condition  has 
negligible  effect  on  the  flux  in  the  initial  period.  At 
times  long  enough  for  “reflected”  earners  to  be 
collected,  the  reflecting  surface  results  in  a  higher  flux 
than  the  absorbing  surface.  Figure  2  shows  the 
collected  charge  as  a  function  of  location  inside  the 
collector  area  at  seven!  time  instances.  For  the 
absorbing  surface  the  plot  shows  that  the  charge 
always  peaks  in  the  center  of  the  collector  and  is 
always  a  minimum  at  the  edge,  la  the  case  of  a 
reflecting  surface  the  charge  distribution  has  a  local 
maximum  at  the  edge  of  the  collector  which  may  be 
larger  than  the  charge  collected  at  the  center.  Clearly 
the  larger  amount  of  charge  collected  near  the  edge 


of  the  collector  is  contributed  by  earners  reflected 
from  the  reflecting  surface. 

3.1  Center  hit 

We  arc  interested  in  finding  the  collected  charge 
which  results  from  an  a  -particle  passing  through  the 
crater  of  the  collector.  We  split  the  problem  into  two 
pans  in  order  to  reduce  the  required  computation 
time.  First  we  find  the  impulse  response  of  the  system 
for  a  point  source  located  at  a  distance  Z  below  the 
collecting  surface.  The  results  of  these  calculations 
are  shown  in  Fig.  3.  It  is  important  to  note  that  for 
a  crater  hit  the  total  charge  collected  by  the  square 
collector  from  a  point  source  is  a  function  of  only  the 
ratio  of  the  depth  of  the  source  Z  and  the  side  length 
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Fig.  2.  The  collected  charge  at  a  function  of  location  inside  the  collector  era  at  several  points  in  time 
The  sire  of  the  collector  is  J  *  5  pm  end  the  point  source  is  located  tt  the  center  of  the  collector  and  7.5  pm 

below  the  surface 
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of  the  collector  W.  This  allows  us  to  calculate  the 
charge  collected  by  collectors  of  many  different  sizes 
form  the  same  impulse  response.  Finally  to  calculate 
the  collected  charge  for  a  given  carrier  distribution 
along  the  i-particle  track  we  only  need  to  convolve 
the  given  distribution  (8]  with  the  impulje  response 
shown  m  Fig.  3. 

The  total  charge  collected  as  a  function  of  alpha 
energy  for  a  square  collector  with  side  lengths  of  2.5. 
S.O.  7.5.  and  10.0 >im  is  shown  in  Fig.  4.  As  would  be 
expected  the  effect  of  the  reflecting  surface  is  to 
increase  the  charge  collected.  For  i -particles  striking 
in  the  center  of  the  collector  it  is  at  most  an  increase 
of  40%  for  3  MeV  s -particle  energies  and  lower.  For 
higher  i-particle  energies  or  for  W  <  2.5  pm  the 
increase  can  be  as  much  as  80%. 

The  total  charge  collected  as  a  function  of  collector 
size  is  shown  in  Fig.  5.  As  noted  by  Kirkpatrick  [2]  the 


collected  charge  scales  linearly  with  the  radius  for  the 
absorbing  surface.  For  the  reflecting  surface  the 
results  is  similar  except  that  for  low  s -panicle  ener¬ 
gies  and  when  the  length  of  the  side  is  larger  than 
approximately  three  microns  the  collected  charge 
decrease  slightly  less  rapidly  than  a  linear  curve 
would  predict  In  contrast  the  charge  stored  on  the 
storage  capacitor  of  a  dynamic  RAM  may  scale  with 
the  second  power  of  feature  length.  Figure  5  also 
shows  that  the  reflecting  surface  results  in  about  60% 
more  charge  collected  then  the  absorbing  surface  for 
W  <3  ^m. 

3.2  Edge  hit 

The  impulse  response  for  an  initial  point  source  of 
carriers  lying  a  distance  Z  below  the  edge  of  the 
collector  is  given  in  Fig.  6.  Once  again  it  is  important 
to  note  that  the  collected  charge  is  only  a  function  of 


Fig.  4.  Total  charge  collected  for  a  center  hit  at  a  function  of  x-peruck  energy  for  s-panicks  unking 
the  center  of  collectors  with  side  lengths  of  2.3,  3.0,  7.3  and  I0.0>im. 
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Fig.  3.  Total  charge  collected  as  a  function  of  collector  ndc  length  for  «  -pemcies  of  3  and  7  MeV  stnking 

the  center  of  the  collector. 
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the  ratio  of  the  distance  from  the  surface  to  the 
source  and  the  side  length  of  the  collector. 

The  total  collected  charger  as  a  function  of  alpha 
energy  for  a  square  collector  with  side  lengths  of  2.5, 
5.0,  7.5  and  10.0  >itn  is  shown  in  Fig.  7.  As  would  be 
expected  the  reflecting  surface  has  a  greater  effect  on 
the  collected  charge  in  the  case  of  an  edge  hit  than  in 
the  case  of  a  center  hit.  The  collected  charge  for  a 
square  collector  of  side  length  2W  surrounded  by  an 
absorbing  surface  is  by  coincidence  about  equal  to 
the  collected  charge  for  a  square  collector  of  side 
length  W  surrounded  by  a  reflecting  surface. 

3.3  Near  miss 

Figure  8  shows  the  locations  of  these  strikes  in 
comparison  to  the  direct  hit  and  edge  hit.  It  is 
important  to  note  in  this  case  that  the  collected 


charge  is  not  only  a  function  of  the  ratio  of  the  depth 
of  the  source  and  the  side  length  of  the  collector 
Z/W.  We  must  also  specify  the  distance  X  from  the 
edge  of  the  collector  to  the  point  where  the  source 
projects  onto  the  surface  This  dependence  also  scales 
so  that  the  collected  charge  may  be  expressed  in  terms 
of  the  ratios  XIW  and  Z/W.  Figure  9  shows  the 
impulse  responses  for  near  misses  for  XIW  -0.5,  I 
and  1 .5.  We  note  that  for  a  collector  surrounded  by 
an  absorbing  surface  the  collection  efficiency  peaks  at 
some  distance  away  from  the  surface  and  that  this 
distance  increases  as  we  move  away  from  the  col¬ 
lector.  This  is  due  to  the  fact  that  earners  generated 
very  dose  to  the  surface  have  a  high  probability  of 
being  absorbed  by  the  absorbing  surface  before 
reaching  the  collector.  For  the  case  of  a  collector 
‘  surrounded  by  a  reflecting  surface  the  maximum 
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Fig  7.  Total  duryt  eoflactad  for  aa  adgs  hit  as  a  function  of  e-pemctc  energy  for  e-psrticles  sinking 
the  edge  of  collectors  with  sde  lengths  of  2.3,  3.0,  7.3  and  10.0  pm.  The  reflecting  surface  is  represented 
by  a  toM  bne  while  the  absorbing  surface  is  represented  by  a  dashed  bnc. 
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Fig.  t.  Location  of  center  hit.  edge  hit  and  near  miss 
in  relation  to  the  collector. 


collection  efficiency  occurs  at  the  surface.  Thus,  the 
two  solutions  diverge  from  each  other  as  they  ap¬ 
proach  the  surface. 

Figure  10  compares  the  charge  collected  for  a 
center  hit,  an  edge  hit  and  a  near  miss  as  a  function 
of  * -particle  energy  using  a  square  collector  of  side 
length  5  pm.  We  can  see  in  this  figure  that  as  the 
strike  location  moves  away  from  the  collector  the 
collection  efficiency  decays  rapidly  for  the  case  of  a 
collector  surrounded  by  an  absorbing  surface  In 
comparison,  for  a  collector  surrounded  by  a  reflecting 
surface  the  collection  efficiency  decays  much  more 
slowly.  It  is  also  important  to  note  that  as  the 
a -particle  strike  moves  further  from  the  center  of  the 
collector  the  peak  in  the  collection  efficiency  occurs 
at  higher  a-partjele  energies. 


ALPHA  ENERGY  (M*V) 

Fig.  10.  Tout  charge  collected  for  a  center  hit.  an  edge  hit  and  a  near  miss  as  a  function  of  s-parucle 
energy  using  a  square  collector  of  side  length  5  pm  The  reflecting  surface  is  represented  by  a  solid  line 
while  the  absorbing  surface  ■  represented  by  a  dashed  line 
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Figure  II  compares  (be  charge  collected  for  a 
center  hit,  edge  hit  and  near  miss  as  a  function  of 
collector  side  length  on  a  Log-Log  scale  for  an  a 
energy  of  5  MeV.  For  the  near  misses  the  distance  X 
is  also  scaled  so  that  the  ratio  X/W  remains  constant. 
At  small  side  lengths  the  charge  collected  for  all  cases 
is  approximately  linear.  Figure  1 1  also  shows  that 
about  7  times  more  charge  is  collected  with  a 
reflecting  surface  than  with  an  absorbing  surface 
when  JT/1F-1.0. 

4.  CONCLUSION  AND  DISCUSSION 

A  numerical  solution  of  the  diffusion  equation  for 
the  case  of  a  collector  surrounded  by  a  reflecting 
surface  has  been  presented.  For  a  “center  hit”  the 
results  show  that  the  charge  collected  for  the  case  of 
a  collector  surrounded  by  a  reflecting  surface  is  up  to 
two  times  that  for  the  case  of  one  surrounded  by  an 
absorbing  surface. 

For  an  “edge  hit”  the  difference  in  collected  charge 
between  a  collector  surrounded  by  an  absorbing 
surface  and  one  surrounded  by  a  reflecting  surface  is 
more  pronounced.  A  rule  of  thumb  for  this  case  is 
that  the  collected  charge  for  a  square  collector  of  side 
lengths  2  W  surrounded  by  an  absorbing  surface  is 
equal  to  the  collected  charge  for  one  of  side  length  L 
surrounded  by  a  reflecting  surface. 

In  the  case  of  a  near  miss  the  collection  efficiency 
for  a  collector  surrounded  by  a  reflecting  surface  is 
shown  to  decrease  much  slower  than  it  does  for  a 
collector  surrounded  by  an  absorbing  region  as  the 
strike  location  moves  away  from  the  collector.  Com¬ 
pared  to  an  absorbing  surface,  a  reflecting  surface 
results  in  about  7  tiroes  more  collected  charge  when 
an  a  strikes  at  one  collector  length  away  from  the 
collector  edge. 

The  results  presented  here  do  not  depend  on  the 
recombination  lifetime  because  recombination  has 
been  neglected.  This  is  a  good  approximation  for  the 
collection  of  ■-genera tad  charge  in  silicon.  The  anal¬ 


ysis  can  easily  include  recombination  as  might  be 
desirable  for  GaAs. 

While  diffusion  based  analysis  is  probably  incom¬ 
plete  for  the  case  of  center  hit  because  of  the  fun- 
neling  phenomena [S.  6]  it  is  still  useful.  Sai- 
Halasz(3. 4]  found  good  agreement  with  experiment 
by  assuming  that  charge  generated  outside  the  Tun¬ 
neling  length  are  collected  by  diffusion.  Hu[7]  sug¬ 
gested  that  all  the  carriers  along  the  track  drift 
toward  the  junction  by  a  Tunneling  length  during  the 
funnehng  period  so  that  the  problem  of  diffusion 
collection  after  the  funneling  period  is  equivalent  to 
that  solved  in  this  paper  for  a  lower-energy  i -particle. 
Thus,  the  diffusion  based  model,  which  is  complete 
when  the  « -particle  strike  does  not  pass  through  the 
junction  depletion  region,  can  also  be  useful  for 
diffusion  collection  in  this  case  when  properly  ad¬ 
justed  for  the  new  initial  conditions  after  the  fun¬ 
neling  phenomena. 
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